476

I have a dataframe df imported from an Excel document like this:

cluster load_date   budget  actual  fixed_price
A   1/1/2014    1000    4000    Y
A   2/1/2014    12000   10000   Y
A   3/1/2014    36000   2000    Y
B   4/1/2014    15000   10000   N
B   4/1/2014    12000   11500   N
B   4/1/2014    90000   11000   N
C   7/1/2014    22000   18000   N
C   8/1/2014    30000   28960   N
C   9/1/2014    53000   51200   N

I want to be able to return the contents of column 1 df['cluster'] as a list, so I can run a for-loop over it, and create an Excel worksheet for every cluster.

Is it also possible to return the contents of a whole column or row to a list? e.g.

list = [], list[column1] or list[df.ix(row1)]
smci
  • 29,564
  • 18
  • 109
  • 144
yoshiserry
  • 17,689
  • 30
  • 72
  • 100
  • 23
    Pandas dataframe columns are a pandas series when you pull them out, which you can then call `.tolist()` on to turn them into a python list – Ben Mar 12 '14 at 03:15
  • 4
    From v0.24 onwards, `.values` will NO LONGER BE the preferred method for accessing underlying numpy arrays. See [this answer](https://stackoverflow.com/a/54324513/4909087). – cs95 Jan 27 '19 at 21:22
  • **Important note:** Converting a Pandas Series to list or NumPy array is often unnecessary, and it almost certainly is in OP's case. – AMC Jan 07 '20 at 18:01
  • 3
    Also, there is no need to read the overly long answers for such a trivial question. [`df.to_numpy().tolist()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html) should be fine for most use cases. – AMC Jan 07 '20 at 18:50
  • 2
    Simply typecast using `list(x)` – Pe Dro Jun 19 '20 at 07:03

7 Answers7

796

Pandas DataFrame columns are Pandas Series when you pull them out, which you can then call x.tolist() on to turn them into a Python list. Alternatively you cast it with list(x).

import pandas as pd

data_dict = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
             'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(data_dict)

print(f"DataFrame:\n{df}\n")
print(f"column types:\n{df.dtypes}")

col_one_list = df['one'].tolist()

col_one_arr = df['one'].to_numpy()

print(f"\ncol_one_list:\n{col_one_list}\ntype:{type(col_one_list)}")
print(f"\ncol_one_arr:\n{col_one_arr}\ntype:{type(col_one_arr)}")

Output:

DataFrame:
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

column types:
one    float64
two      int64
dtype: object

col_one_list:
[1.0, 2.0, 3.0, nan]
type:<class 'list'>

col_one_arr:
[ 1.  2.  3. nan]
type:<class 'numpy.ndarray'>
samlima
  • 144
  • 9
Ben
  • 10,992
  • 3
  • 29
  • 62
  • 35
    I can't get my head around the style of the docs, because it's almost always straight syntax, where as I need syntax and example. E.g. Syntax would be to create a set: use the set keyword, and a list: Accompanying example: alist = df.cluster.tolist(). Until pandas is written in this way I will struggle. it's getting there, there are some examples now, but not for every method. – yoshiserry Mar 12 '14 at 04:02
  • Thanks @Ben, great Answer! Can you tell me about the Dataframe method, Ive never seen that before... seems like you are converting a dinctionary to a df? df = DataFrame(d)? – yoshiserry Mar 12 '14 at 04:14
  • One of the default ways to make a dataframe is to pass it a list of dictionaries with matching keys. – Ben Mar 12 '14 at 04:15
  • Is it also possible to make the series into a list using the set command in one line? I've been able to do it with 4: c = data.cluster.tolist() u = set(c) for i in u: print i – yoshiserry Mar 12 '14 at 04:17
  • extending the example above you could add `print(set(df['one'].tolist()))` to give you the unique values. BUT you might be better off using the query type functions of the dataframe, e.g. `filtered = df[df["one"] == 1]` But this might be gatting to the point where it needs it's own question. – Ben Mar 12 '14 at 04:30
  • That's actually where I was leading to all along, but wanted to get the answer piece by piece! Thanks! To expand! I was actually hoping to use the set function you described to get a,b,c unique values from the clusters column. Then use for loops in conjunction with the dataframe query type functions: filtered = df[df["cluster"] == unique value 1 etc from for loop]. Maybe I should edit the question? / create another one for the filtering part? – yoshiserry Mar 12 '14 at 04:34
  • I intended to use this set function you showed me to get unique values for all columns, and then get different dataframes based on: cluster value, e.g. A, each different loaddate, e.g. 2014-02-01, and then fixed prixed (yes or no). I was going to do this using nested for loops – yoshiserry Mar 12 '14 at 04:37
  • dataframes also have a groupby method that might be helpful, that would do both the unique values, and the filtering, but they are a bit of a mystery to me. Make a new question and someone might know how to use them. – Ben Mar 12 '14 at 04:48
  • 2
    @yoshiserry most of the common functions now have example usage in their documentation, below the syntax and argument listing. You can also see 15 minutes to pandas for more beginner level examples. – cs95 Jun 05 '19 at 05:55
  • 2
    @Ben I hadn't seen you are still active on SO, I wanted to mention that I submitted a decently sized edit to this answer, so let me know what you think :) – AMC Jan 07 '20 at 18:21
  • If you use `df['column'].tolist()`, be sure to _not_ use double brackets like this: `df[['column']].tolist()`. Since it will then return a dataframe object for which `.tolist()` is not available. – ThomasDeLange Mar 23 '22 at 14:04
72

This returns a numpy array:

arr = df["cluster"].to_numpy()

This returns a numpy array of unique values:

unique_arr = df["cluster"].unique()

You can also use numpy to get the unique values, although there are differences between the two methods:

arr = df["cluster"].to_numpy()
unique_arr = np.unique(arr)
AMC
  • 2,535
  • 7
  • 12
  • 34
Anirudh Bandi
  • 1,121
  • 10
  • 20
  • What are the differences between the two methods? – Pro Q Jun 29 '20 at 20:36
  • 1
    @ProQ pandas unique method does not sort unlike np.unique() therefore it is faster. For more info look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html and https://numpy.org/devdocs/reference/generated/numpy.unique.html. The docs are pretty solid for both the methods. – Anirudh Bandi Jun 30 '20 at 00:07
  • But the question is about a python list? – MrR Oct 23 '21 at 23:06
17

Example conversion:

Numpy Array -> Panda Data Frame -> List from one Panda Column

Numpy Array

data = np.array([[10,20,30], [20,30,60], [30,60,90]])

Convert numpy array into Panda data frame

dataPd = pd.DataFrame(data = data)
    
print(dataPd)
0   1   2
0  10  20  30
1  20  30  60
2  30  60  90

Convert one Panda column to list

pdToList = list(dataPd['2'])

Hrvoje
  • 10,368
  • 5
  • 67
  • 78
  • 2
    Why show the array creation code twice, as if it were an important part of the solution? Why even create that array at all, in fact? Isn't `df = pd.DataFrame(data=[[10, 20, 30], [20, 30, 60], [30, 60, 90]])` more straightforward? Also, note the variable name and whitespace which follow Python style conventions. _Iterate over list as a proof_ What does that prove, exactly? That it's a list? – AMC Jan 07 '20 at 18:25
  • Agreed - numpy array stuff is irrelevant to the question – MrR Oct 23 '21 at 15:39
6

As this question attained a lot of attention and there are several ways to fulfill your task, let me present several options.

Those are all one-liners by the way ;)

Starting with:

df
  cluster load_date budget actual fixed_price
0       A  1/1/2014   1000   4000           Y
1       A  2/1/2014  12000  10000           Y
2       A  3/1/2014  36000   2000           Y
3       B  4/1/2014  15000  10000           N
4       B  4/1/2014  12000  11500           N
5       B  4/1/2014  90000  11000           N
6       C  7/1/2014  22000  18000           N
7       C  8/1/2014  30000  28960           N
8       C  9/1/2014  53000  51200           N

Overview of potential operations:

ser_aggCol (collapse each column to a list)
cluster          [A, A, A, B, B, B, C, C, C]
load_date      [1/1/2014, 2/1/2014, 3/1/2...
budget         [1000, 12000, 36000, 15000...
actual         [4000, 10000, 2000, 10000,...
fixed_price      [Y, Y, Y, N, N, N, N, N, N]
dtype: object


ser_aggRows (collapse each row to a list)
0     [A, 1/1/2014, 1000, 4000, Y]
1    [A, 2/1/2014, 12000, 10000...
2    [A, 3/1/2014, 36000, 2000, Y]
3    [B, 4/1/2014, 15000, 10000...
4    [B, 4/1/2014, 12000, 11500...
5    [B, 4/1/2014, 90000, 11000...
6    [C, 7/1/2014, 22000, 18000...
7    [C, 8/1/2014, 30000, 28960...
8    [C, 9/1/2014, 53000, 51200...
dtype: object


df_gr (here you get lists for each cluster)
                             load_date                 budget                 actual fixed_price
cluster                                                                                         
A        [1/1/2014, 2/1/2014, 3/1/2...   [1000, 12000, 36000]    [4000, 10000, 2000]   [Y, Y, Y]
B        [4/1/2014, 4/1/2014, 4/1/2...  [15000, 12000, 90000]  [10000, 11500, 11000]   [N, N, N]
C        [7/1/2014, 8/1/2014, 9/1/2...  [22000, 30000, 53000]  [18000, 28960, 51200]   [N, N, N]


a list of separate dataframes for each cluster

df for cluster A
  cluster load_date budget actual fixed_price
0       A  1/1/2014   1000   4000           Y
1       A  2/1/2014  12000  10000           Y
2       A  3/1/2014  36000   2000           Y

df for cluster B
  cluster load_date budget actual fixed_price
3       B  4/1/2014  15000  10000           N
4       B  4/1/2014  12000  11500           N
5       B  4/1/2014  90000  11000           N

df for cluster C
  cluster load_date budget actual fixed_price
6       C  7/1/2014  22000  18000           N
7       C  8/1/2014  30000  28960           N
8       C  9/1/2014  53000  51200           N

just the values of column load_date
0    1/1/2014
1    2/1/2014
2    3/1/2014
3    4/1/2014
4    4/1/2014
5    4/1/2014
6    7/1/2014
7    8/1/2014
8    9/1/2014
Name: load_date, dtype: object


just the values of column number 2
0     1000
1    12000
2    36000
3    15000
4    12000
5    90000
6    22000
7    30000
8    53000
Name: budget, dtype: object


just the values of row number 7
cluster               C
load_date      8/1/2014
budget            30000
actual            28960
fixed_price           N
Name: 7, dtype: object


============================== JUST FOR COMPLETENESS ==============================


you can convert a series to a list
['C', '8/1/2014', '30000', '28960', 'N']
<class 'list'>


you can convert a dataframe to a nested list
[['A', '1/1/2014', '1000', '4000', 'Y'], ['A', '2/1/2014', '12000', '10000', 'Y'], ['A', '3/1/2014', '36000', '2000', 'Y'], ['B', '4/1/2014', '15000', '10000', 'N'], ['B', '4/1/2014', '12000', '11500', 'N'], ['B', '4/1/2014', '90000', '11000', 'N'], ['C', '7/1/2014', '22000', '18000', 'N'], ['C', '8/1/2014', '30000', '28960', 'N'], ['C', '9/1/2014', '53000', '51200', 'N']]
<class 'list'>

the content of a dataframe can be accessed as a numpy.ndarray
[['A' '1/1/2014' '1000' '4000' 'Y']
 ['A' '2/1/2014' '12000' '10000' 'Y']
 ['A' '3/1/2014' '36000' '2000' 'Y']
 ['B' '4/1/2014' '15000' '10000' 'N']
 ['B' '4/1/2014' '12000' '11500' 'N']
 ['B' '4/1/2014' '90000' '11000' 'N']
 ['C' '7/1/2014' '22000' '18000' 'N']
 ['C' '8/1/2014' '30000' '28960' 'N']
 ['C' '9/1/2014' '53000' '51200' 'N']]
<class 'numpy.ndarray'>

code:

# prefix ser refers to pd.Series object
# prefix df refers to pd.DataFrame object
# prefix lst refers to list object

import pandas as pd
import numpy as np

df=pd.DataFrame([
        ['A',   '1/1/2014',    '1000',    '4000',    'Y'],
        ['A',   '2/1/2014',    '12000',   '10000',   'Y'],
        ['A',   '3/1/2014',    '36000',   '2000',    'Y'],
        ['B',   '4/1/2014',    '15000',   '10000',   'N'],
        ['B',   '4/1/2014',    '12000',   '11500',   'N'],
        ['B',   '4/1/2014',    '90000',   '11000',   'N'],
        ['C',   '7/1/2014',    '22000',   '18000',   'N'],
        ['C',   '8/1/2014',    '30000',   '28960',   'N'],
        ['C',   '9/1/2014',    '53000',   '51200',   'N']
        ], columns=['cluster', 'load_date',   'budget',  'actual',  'fixed_price'])
print('df',df, sep='\n', end='\n\n')

ser_aggCol=df.aggregate(lambda x: [x.tolist()], axis=0).map(lambda x:x[0])
print('ser_aggCol (collapse each column to a list)',ser_aggCol, sep='\n', end='\n\n\n')

ser_aggRows=pd.Series(df.values.tolist()) 
print('ser_aggRows (collapse each row to a list)',ser_aggRows, sep='\n', end='\n\n\n')

df_gr=df.groupby('cluster').agg(lambda x: list(x))
print('df_gr (here you get lists for each cluster)',df_gr, sep='\n', end='\n\n\n')

lst_dfFiltGr=[ df.loc[df['cluster']==val,:] for val in df['cluster'].unique() ]
print('a list of separate dataframes for each cluster', sep='\n', end='\n\n')
for dfTmp in lst_dfFiltGr:
    print('df for cluster '+str(dfTmp.loc[dfTmp.index[0],'cluster']),dfTmp, sep='\n', end='\n\n')

ser_singleColLD=df.loc[:,'load_date']
print('just the values of column load_date',ser_singleColLD, sep='\n', end='\n\n\n')

ser_singleCol2=df.iloc[:,2]
print('just the values of column number 2',ser_singleCol2, sep='\n', end='\n\n\n')

ser_singleRow7=df.iloc[7,:]
print('just the values of row number 7',ser_singleRow7, sep='\n', end='\n\n\n')

print('='*30+' JUST FOR COMPLETENESS '+'='*30, end='\n\n\n')

lst_fromSer=ser_singleRow7.tolist()
print('you can convert a series to a list',lst_fromSer, type(lst_fromSer), sep='\n', end='\n\n\n')

lst_fromDf=df.values.tolist()
print('you can convert a dataframe to a nested list',lst_fromDf, type(lst_fromDf), sep='\n', end='\n\n')

arr_fromDf=df.values
print('the content of a dataframe can be accessed as a numpy.ndarray',arr_fromDf, type(arr_fromDf), sep='\n', end='\n\n')

as pointed out by cs95 other methods should be preferred over pandas .values attribute from pandas version 0.24 on see here. I use it here, because most people will (by 2019) still have an older version, which does not support the new recommendations. You can check your version with print(pd.__version__)

marc_s
  • 704,970
  • 168
  • 1,303
  • 1,425
Markus Dutschke
  • 7,177
  • 2
  • 46
  • 45
5

If your column will only have one value something like pd.series.tolist() will produce an error. To guarantee that it will work for all cases, use the code below:

(
    df
        .filter(['column_name'])
        .values
        .reshape(1, -1)
        .ravel()
        .tolist()
)
Ramin Melikov
  • 869
  • 8
  • 12
  • Gosh all this needed for the one value case? Catch that case with an except clause? – MrR Oct 23 '21 at 15:40
  • What's wrong with: `list(df['column_name'])` - it will work with one item. – MrR Oct 23 '21 at 22:56
  • @MrR If you need to chain your methods, then wrapping it into a list by `list()` will be breaking the consistency. Also, code is clearer this way as we're seeing step-by-step what is happening and we can at any point put a `#` in front of each or multiple lines to modify code and see how each line changes the `df`. – Ramin Melikov Mar 18 '22 at 20:13
0

Assuming the name of the dataframe after reading the excel sheet is df, take an empty list (e.g. dataList), iterate through the dataframe row by row and append to your empty list like-

dataList = [] #empty list
for index, row in df.iterrows(): 
    mylist = [row.cluster, row.load_date, row.budget, row.actual, row.fixed_price]
    dataList.append(mylist)

Or,

dataList = [] #empty list
for row in df.itertuples(): 
    mylist = [row.cluster, row.load_date, row.budget, row.actual, row.fixed_price]
    dataList.append(mylist)

No, if you print the dataList, you will get each rows as a list in the dataList.

Natasha
  • 6,303
  • 1
  • 31
  • 52
  • Variable and function names should follow the `lower_case_with_underscores` style. What advantage does this solution have over the existing ones, exactly? Also, I really discourage the use of attribute-style access on Series and DataFrames. – AMC Jan 07 '20 at 18:45
-3
 amount = list()
    for col in df.columns:
        val = list(df[col])
        for v in val:
            amount.append(v)
kamran kausar
  • 3,467
  • 21
  • 17