5

I am following this tutorial: GitHub Link

If you scroll down (Ctrl+F: Exercise: Select the most-reviewd beers ) to the section that says Exercise: Select the most-reviewd beers:

The dataframe is multindexed: enter image description here

To select the most-reviewed beers:

top_beers = df['beer_id'].value_counts().head(10).index
reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']]

My question is the way of how the IndexSlice is used, how come you can skip the colon after top_beers and the code still run?

reviews.loc[pd.IndexSlice[:, top_beers, :], ['beer_name', 'beer_style']] 

There are three indexes, pofile_name, beed_id and time. Why does pd.IndexSlice[:, top_beers] work (without specify what to do with the time column)?

Cheng
  • 14,944
  • 20
  • 66
  • 100
  • That's what the `:` operator does. You are filtering by only one of the three columns of the hierarchical index. The other two (the ones using `:`) can take any value. You can think of `:` as a filter that matches `True` for any value. – Gustavo Bezerra May 21 '17 at 07:36
  • @GustavoBezerra the problem is that even without the third `:` the code still works. `reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']]` works even without the third `:' – Cheng May 21 '17 at 12:18
  • top_beers is a list. your filtering the second level index field beer id by the top_beers. The other two levels are defaulting all values. if you want to slice by range use slice(a:b) – Golden Lion Feb 16 '21 at 17:03

2 Answers2

15

To complement the previous answer, let me explain how pd.IndexSlice works and why it is useful.

Well, there is not much to say about its implementation. As you read in the source, it just does the following:

class IndexSlice(object):
    def __getitem__(self, arg):
        return arg

From this we see that pd.IndexSlice only forwards the arguments that __getitem__ has received. Looks pretty stupid, doesn't it? However, it actually does something.

As you certainly know already, obj.__getitem__(arg) is called if you access an object obj through its bracket operator obj[arg]. For sequence-type objects, arg can be either an integer or a slice object. We rarely construct slices ourselves. Rather, we'd use the slice operator : (aka ellipsis) for this purpose, e.g. obj[0:5].

And here comes the point. The python interpretor converts these slice operators : into slice objects before calling the object's __getitem__(arg) method. Therefore, the return value of IndexSlice.__getItem__() will actually be a slice, an integer (if no : was used), or a tuple of these (if multiple arguments are passed). In summary, the only purpose of IndexSlice is that we don't have to construct the slices on our own. This behavior is particularly useful for pd.DataFrame.loc.

Let's first have a look at the following examples:

import pandas as pd
idx = pd.IndexSlice
print(idx[0])               # 0
print(idx[0,'a'])           # (0, 'a')
print(idx[:])               # slice(None, None, None)
print(idx[0:3])             # slice(0, 3, None)
print(idx[0.1:2.3])         # slice(0.1, 2.3, None)
print(idx[0:3,'a':'c'])     # (slice(0, 3, None), slice('a', 'c', None))

We observe that all usages of colons : are converted into slice object. If multiple arguments are passed to the index operator, the arguments are turned into n-tuples.

To demonstrate how this could be useful for a pandas data-frame df with a multi-level index, let's have a look at the following.

# A sample table with three-level row-index
# and single-level column index.
import numpy as np
level0 = range(0,10)
level1 = list('abcdef')
level2 = ['I', 'II', 'III', 'IV']
mi = pd.MultiIndex.from_product([level0, level1, level2])
df = pd.DataFrame(np.random.random([len(mi),2]), 
                  index=mi, columns=['col1', 'col2'])

# Return a view on 'col1', selecting all rows.
df.loc[:,'col1']            # pd.Series         

# Note: in the above example, the returned value has type
# pd.Series, because only one column is returned. One can 
# enforce the returned object to be a data-frame:
df.loc[:,['col1']]          # pd.DataFrame, or
df.loc[:,'col1'].to_frame() # 

# Select all rows with top-level values 0:3.
df.loc[0:3, 'col1']   

# If we want to create a slice for multiple index levels
# we need to pass somehow a list of slices. The following
# however leads to a SyntaxError because the slice 
# operator ':' cannot be placed inside a list declaration.
df.loc[[0:3, 'a':'c'], 'col1'] 

# The following is valid python code, but looks clumsy:
df.loc[(slice(0, 3, None), slice('a', 'c', None)), 'col1']

# Here is why pd.IndexSlice is useful. It helps
# to create a slice that makes use of two index-levels.
df.loc[idx[0:3, 'a':'c'], 'col1'] 

# We can expand the slice specification by a third level.
df.loc[idx[0:3, 'a':'c', 'I':'III'], 'col1'] 

# A solitary slicing operator ':' means: take them all.
# It is equivalent to slice(None).
df.loc[idx[0:3, 'a':'c', :], 'col1'] # pd.Series

# Semantically, this is equivalent to the following,
# because the last ':' in the previous example does 
# not add any information about the slice specification.
df.loc[idx[0:3, 'a':'c'], 'col1']    # pd.Series

# The following lines are also equivalent, but
# both expressions evaluate to a result with multiple columns.
df.loc[idx[0:3, 'a':'c', :], :]    # pd.DataFrame
df.loc[idx[0:3, 'a':'c'], :]       # pd.DataFrame

In summary, pd.IndexSlice helps to improve readability when specifying slices for rows and column indices.

What pandas then does with these slices is a different story. It essentially selects rows/columns, starting from the topmost index-level and reduces the selection when going further down the levels, depending on how many levels have been specified. pd.DataFrame.loc is an object with its own __getitem__() function that does all this.

As you pointed out already in one of your comments, pandas seemingly behaves weird in some special cases. The two examples you mentioned will actually evaluate to the same result. However, they are treated differently by pandas internally.

# This will work.
reviews.loc[idx[top_reviewers,        99, :], ['beer_name', 'brewer_id']]
# This will fail with TypeError "unhashable type: 'Index'".
reviews.loc[idx[top_reviewers,        99]   , ['beer_name', 'brewer_id']]
# This fixes the problem. (pd.Index is not hashable, a tuple is.
# However, the problem matters only with the second expression.)
reviews.loc[idx[tuple(top_reviewers), 99]   , ['beer_name', 'brewer_id']]

Admittedly, the difference is subtle.

normanius
  • 6,540
  • 4
  • 42
  • 72
  • What indices were float numbers? how would it work then? – arash Jan 11 '20 at 13:04
  • @arash: The same. `slice()` is agnostic of datatypes. It just bundles information about `start`, `end` and `step`. How a particular slice (e.g. `slice(0.1, 2.3, 4.5)`) is interpreted, depends on the object receiving the slice. For a `df = pd.DataFrame([[1,2,3],[4,5,6]], columns=[0.1,2.3,4.5])` you can access all columns by `idx[0.1:4.5]`, which is consistent with the behavior for other index types. And it's not too surprising that `pandas` raises an error for `idx[0.1:4.5:2.3]` because it cannot give sense to a float-type step. – normanius Jan 11 '20 at 13:58
  • @arash See maybe also [this answer](https://stackoverflow.com/a/3912107/3388962) – normanius Jan 11 '20 at 13:58
5

Pandas only requires you to specify enough levels of the MultiIndex to remove an ambiguity. Since you're slicing on the 2nd level, you need the first : to say I'm not filtering on this level.

Any additional levels not specified are returned in their entirety, so equivalent to a : on each of those levels.

TomAugspurger
  • 26,229
  • 8
  • 81
  • 69
  • If that is the case then why can't I remove the colon from this line within the same tutorial `reviews.loc[pd.IndexSlice[top_reviewers, 99,:], ['beer_name', 'brewer_id']]`, if I remove the colon and comma after `99`, I get a `unhashable type: 'Index'` error – Cheng May 22 '17 at 00:15
  • Not sure off the top of my head. Based on the error message, about `Index` being unhashable, it's possible it's taking a different indexing path. You could open an issue on github with a simpler example and we'll take a look. – TomAugspurger May 24 '17 at 13:11
  • 1
    @Cheng: The problem is that `top_reviewers` is of type `pd.Index`, which apparently is not hashable out of the box. To fix this, you could transform it into a list first (which can be further transformed into a hashable object). So the following will work: `reviews.loc[pd.IndexSlice[top_reviewers.tolist(), 99], ['beer_name', 'brewer_id']]` – normanius Oct 30 '18 at 15:12
  • @Cheng But it's true that you discovered a small inconsistency in the way pandas processes slices: `top_reviewers` in `pd.IndexSlice[top_reviewers, 99, :]` and `pd.IndexSlice[top_reviewers, 99]` is not treated in exactly the same way, the latter leading to an error, while the former does not. – normanius Oct 30 '18 at 15:17