6

I am fairly new to pandas and come from a statistics background and I am struggling with a conceptual problem: Pandas has columns, who are containing values. But sometimes values have a special meaning - in a statistical program like SPSS or R called a "value labels".

Imagine a column rain with two values 0 (meaning: no rain) and 1 (meaning: raining). Is there a way to assign these labels to that values?

Is there a way to do this in pandas, too? Mainly for platting and visualisation purposes.

buhtz
  • 8,057
  • 11
  • 59
  • 115
Christian Sauer
  • 9,530
  • 9
  • 50
  • 76
  • Do you want to store the values as strings or assign some special meaning later? i.e. use a lookup or add a new column that maps the values to human friendly values? Or do you just want this information in the legend of your plot? – EdChum Mar 19 '14 at 08:31
  • 1
    @EdChum Ideally, I want no new column at all - e.g. in SPSS the label is frequently used for displaying data in tables, plots etc. but you can use the numeric value for conditional. At my work, I often have variables with 30+ different "labels" per column - having the associated strings visible would be huge help (e.g. avoiding the "what was the meaning of 21?"-question) – Christian Sauer Mar 19 '14 at 08:38
  • You could add it as an attribute which is general to Python and not specific to Pandas and access it for your plots see related: http://stackoverflow.com/questions/14688306/adding-meta-information-metadata-to-pandas-dataframe – EdChum Mar 19 '14 at 08:42
  • 1
    That would probably not be used by any normal porcudeure, but thanks for the suggestion! – Christian Sauer Mar 19 '14 at 09:36

2 Answers2

5

There's not need to use a map anymore. Since version 0.15, Pandas allows a categorical data type for its columns. The stored data takes less space, operations on it are faster and you can use labels.

I'm taking an example from the pandas docs:

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
#Recast grade as a categorical variable
df["grade"] = df["raw_grade"].astype("category")

df["grade"]

#Gives this:
Out[124]: 
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

You can also rename categories and add missing categories

cd98
  • 3,282
  • 2
  • 31
  • 48
4

You could have a separate dictionary which maps values to labels:

 d={0:"no rain",1:"raining"}

and then you could access the labelled data by doing

 df.rain_column.apply(lambda x:d[x])
grasshopper
  • 3,841
  • 3
  • 22
  • 28
  • 1
    `map` might be better for this simple case – EdChum Mar 19 '14 at 09:30
  • What is the difference in this case? – grasshopper Mar 19 '14 at 09:34
  • 3
    Only better in terms of simpler syntax: `df.rain_column.map(d)`, and perhaps faster performance-wise, it depends on data size and type for a dataframe with 100 rows then `apply` is marginally faster (apply 228 us vs map 287us), for one with 10000 rows then map is 26 times faster (map is 512 us vs apply 13 ms) – EdChum Mar 19 '14 at 10:10
  • Alright, this makes a lot of sense, since apply is more general purpose than map. – grasshopper Mar 19 '14 at 10:12
  • I will accept cd98 answer which is better for newer versions of pandas, if that's ok for you. – Christian Sauer Sep 24 '15 at 08:30