1

I am a beginner, starting to use a new .dta data set in Stata.
It has variables like schooling and gender with storage type 'byte', which implies they are integers.

However the individual entries are "highschool" or "female". When I then try to stuff like

drop schooling if gender==female

it says female not found.

And when I try it this way

drop schooling if gender=="female"

it says type mismatch.

I'm confused as to what the actual problem is. I tried to convert the variables into strings with

gen state1 = string(state)

which only yielded a state1 variable full of "1" entries.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • Off-topic here as focused on details in Stata. Please see advice in the Help Center on off-topic questions. – Nick Cox Feb 23 '20 at 12:15
  • Thanks for pointing this out. Do I see correctly, that such a question would be on-topic on stack overflow? – Tototulbi Feb 23 '20 at 15:29
  • I am active there too. I won’t usually recommend questions for migration from CV to SE if they don’t match good standards there, by being self-contained in terms of data and code. MCVE is one search term. – Nick Cox Feb 23 '20 at 15:53
  • .... to SO ... not SE. – Nick Cox Feb 23 '20 at 17:37

1 Answers1

1

This error occurs because data in your variable female is stored as a byte, which I'm guessing takes only the value 0 or 1. When you look at the variable in the data editor, you might see female and male, but this is produced by a so-called value label which maps the byte to a string, e.g. (0 -> male and 1 -> female). So the correct way to perform your operations would be

drop schooling if gender==1

See these pages on Value labels and Data types in Stata

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
Joe
  • 103
  • 8
  • Thank you for your advice! This makes sense and I'll look into the links you provided. I just want to note, that the variables stored as byte are not binary. I have, for example, a "byte-variable" for "state" that takes many different values. – Tototulbi Feb 23 '20 at 10:20
  • 1
    That makes sense, I was just trying to emphasize that there is a difference between the actual data stored in the byte and the value label. If you want to see which value is assigned to which label you can type label list gender. Usually, variables that have a value label assigned to it will appear in blue in your data editor/browser. – Joe Feb 23 '20 at 10:23
  • [sorry, i saw your reply just after writing this] I checked this for a variable that is binary (e.g. married) and there it is really the case that it's coded with 0 = "single" and 1 = "married".

    I can't quite figure out yet, how I find the underlying information about label-assignment.

    If I use "label list" as from your link, I get the information about all labels from every variable from the original data set (which was huge). But if I use "label list state" or "label list gender", I get a 'value label state not found'.

    – Tototulbi Feb 23 '20 at 10:28
  • 1
    I guess that this means that there is no label list attached to state and gender. As shown in the link attached to my answer, you could attach a label list to your variables (e.g. label define gender 0 "male" 1 "female") but as far as I'm aware you'll need to use the underlying byte in your code (see the code example in my answer) – Joe Feb 23 '20 at 10:35
  • 3
    @Joe gives helpful advice but storage as a byte is not the explanation here: you would get the same problem if the variables concerned were any other numeric type. It's the misreading of value labels as if they were string values that is (correctly) explained as the problem. – Nick Cox Feb 23 '20 at 12:08
  • 2
    https://www.stata.com/manuals/u13.pdf explains a way to specify value labels in expressions (see 13.11 there). Note that label list state -- as Stata is telling you -- looks for a value label set named state; that value label name doesn't have to be identical to the name of the variable. But describe state will tell you the name of the value label associated with a variable if any is. – Nick Cox Feb 23 '20 at 12:14
  • Thanks a lot, Nick! This was a great explanation. By using the "value label" instead of the "variable label" in list label xy I am now able to identify all the labels of any variable! – Tototulbi Feb 23 '20 at 12:32
  • 2
    Good, but you mean label list xy as list is a quite different command. – Nick Cox Feb 23 '20 at 12:56