24

I am working with pandas dataframes that are essentially time series like this:

             level
Date              
1976-01-01  409.67
1976-02-01  409.58
1976-03-01  409.66
…

What I want to have, is multiple indexes/headers for the level column, like so:

           Station1                   #Name of the datasource
           43.1977317,-4.6473648,5    #Lat/Lon of the source
           Precip                     #Type of data
Date              
1976-01-01  409.67
1976-02-01  409.58
1976-03-01  409.66
…

So essentially I am searching for something like Mydata.columns.level1 = ['Station1'], Mydata.columns.level2 = [Lat,Lon], Mydata.columns.level3 = ['Precip'].

Reason being that a single location can have multiple datasets, and that I want to be able to pick either all data from one location, or all data of a certain type from all locations, from a subsequent merged, big dataframe.

I can set up an example dataframe from the pandas documentation, and test my selection, but with my real data, I need a different way to set the indexes as in the example.

Example:

Built a small dataframe

header = [np.array(['location','location','location','location2','location2','location2']), 
np.array(['S1','S2','S3','S1','S2','S3'])] 
df = pd.DataFrame(np.random.randn(5, 6), index=['a','b','c','d','e'], columns = header )   

df
    location                      location2                    
         S1        S2        S3         S1        S2        S3
a -1.469932 -1.544511 -1.373463  -0.317262  0.024832 -0.641000
b  0.047170 -0.339423  1.351253   0.601172 -1.607339  0.035932
c -0.257479  1.140829  0.188291  -0.242490  1.019315 -1.163429
d  0.832949  0.098170 -0.818513  -0.070383  0.557419 -0.489839
e -0.628549 -0.158419  0.366167  -2.319316 -0.474897 -0.319549

Pick datatype or location:

df.loc(axis=1)[:,'S1']

   location  location2
         S1         S1
a -1.469932  -0.317262
b  0.047170   0.601172
c -0.257479  -0.242490
d  0.832949  -0.070383
e -0.628549  -2.319316

df['location']

         S1        S2        S3
a -1.469932 -1.544511 -1.373463
b  0.047170 -0.339423  1.351253
c -0.257479  1.140829  0.188291
d  0.832949  0.098170 -0.818513
e -0.628549 -0.158419  0.366167

Or am I just looking for the wrong terminology? Because 90% of all examples in the documentation, and the questions here only treat the vertical "stuff" (dates or abcde in my case) as index, and a quick df.index.values on my test data also just gets me the vertical array(['a', 'b', 'c', 'd', 'e'], dtype=object).

JC_CL
  • 1,936
  • 4
  • 19
  • 34
  • 1
    What is exactly your question? Starting from your example dataframe, what do you want to obtain? – joris Sep 03 '15 at 10:07
  • I am aksing how give a dataframe multiple indexes. Instead of a single `level`, I need strings for `Location`, `LatLon` and `Type`. The example dataframes are just to illustrate why I need multiple indexes. – JC_CL Sep 03 '15 at 10:45
  • Do you mean something like `df.loc(axis=1)['location','S1']` (or `df['location','S1']`)? – joris Sep 03 '15 at 11:02
  • No, the "picking stuff" part I have figured out. This is just for illustrating the "why"! I am looking for a way to turn my real data into a form where I could use those selections. I changed the qeustion a bit. I am looking for a way to set multiple columns with `df.columns = ['something']` – JC_CL Sep 03 '15 at 11:14
  • 2
    Ah, you want to create the multi-indexed columns. You can do something like `df.columns = pd.MultiIndex.from_tuples([('Station1', 'Precip', ..), ..])` – joris Sep 03 '15 at 11:23
  • Yes. That (and @neoFox answer below) seems to be what I am looking for. – JC_CL Sep 03 '15 at 11:44

1 Answers1

40

You can use multiIndex to give multiple columns with names for each level. Use MultiIndex.from_product() to make multiIndex from cartesian products of multiple iterables.

header = pd.MultiIndex.from_product([['location1','location2'],
                                     ['S1','S2','S3']],
                                    names=['loc','S'])
df = pd.DataFrame(np.random.randn(5, 6), 
                  index=['a','b','c','d','e'], 
                  columns=header)

Two levels will be loc and S.

df
loc location1                     location2                    
S          S1        S2        S3        S1        S2        S3
a   -1.245988  0.858071 -1.433669  0.105300 -0.630531 -0.148113
b    1.132016  0.318813  0.949564 -0.349722 -0.904325  0.443206
c   -0.017991  0.032925  0.274248  0.326454 -0.108982  0.567472
d    2.363533 -1.676141  0.562893  0.967338 -1.071719 -0.321113
e    1.921324  0.110705  0.023244 -0.432196  0.172972 -0.50368

Now you can use xs to slice the dateframe based on levels.

df.xs('location1',level='loc',axis=1)

S        S1        S2        S3
a -1.245988  0.858071 -1.433669
b  1.132016  0.318813  0.949564
c -0.017991  0.032925  0.274248
d  2.363533 -1.676141  0.562893
e  1.921324  0.110705  0.02324

df.xs('S1',level='S',axis=1)

loc  location1  location2
a    -1.245988   0.105300
b     1.132016  -0.349722
c    -0.017991   0.326454
d     2.363533   0.967338
e     1.921324  -0.43219
Mel
  • 5,460
  • 10
  • 38
  • 41
kanatti
  • 705
  • 7
  • 10
  • That seems to be what I am looking for. I also like the added names for the columns! But when playing around with my real data, it behaves erratically. At first a `df1.columns = header` (Which should be simply changing the header of an existing dataframe?!) also changed the headers of `df2` and `df3`, copied from the same original df, and now after `clear`ing ipython, `df1.columns = header` doesn't do anything. – JC_CL Sep 03 '15 at 11:49
  • 4
    if you used `df1=df df2=df` for copying , it wont create a copy. It will only create a variables df1 and df2 that refers to df. So any changes made to df2 will also change df1 and vice versa. To explicitly make a copy, use `df1=df.copy() df2=df.copy()` . – kanatti Sep 03 '15 at 12:01
  • Ah, crap… my favorite mistake when playing in ipython. Now it works, and your added names and slicing examples are much better than my first approach. Thanks! – JC_CL Sep 03 '15 at 12:21