0

I have a longitudinal dataset and I am trying to create two variables that correspond to two time periods based on specific date ranges (pre- and post-) to be able to analyze the effect of each of those time periods on my outcome. However, when I tried to create these time period variables as exposures to use with my outcome using logistic regression, I get this error message "outcome = variable_name > 0 predicts data perfectly."

This is how I created my time period variables:

**Period 1:** bysort id_variable:  gen pre-period = binary_outcome if (date_of_eval < mdy(5,4,2020))
**Period 2** bysort id_variable:  gen post-period = binary_outcome if (date_of_eval > mdy(5,4,2020))

Does anyone know what this error code means or if I have incorrectly coded my variables?

Thank you!

1 Answers1

0

There is a statistical question deep down here, which is about constant outcomes (or predictors), but it is hard to answer as you don't give the syntax for the Stata command that didn't work. This would be a poor question for Stack Overflow for that and other reasons. You'd be better off on Statalist, but I will answer so far as I can, noting that there is too much to say to fit easily in comments.

Note that

  1. bysort Id_variable: has no bearing whatsoever on the outcome of your calculations. It does no harm, but you'd get the same results without it.

  2. post-period and pre-period are illegal Stata variable names, as containing hyphens, so presumably you used something else, perhaps underscores.

The biggest deal of all is that pre_period will be missing when post_period is equal to 1 and vice versa. So at best either would be a constant outcome if given as an outcome, or a constant predictor if given as a predictor, which may underlie your difficulties. Either way round, observations with missing values would be ignored for any variable mentioned.

My guess is that a single indicator variable

gen when = date_of_eval > mdy(5,4,2020)

is all that you need. It will have value 1 when true and value 0 when false.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • Hi Nick, thank you for your help! I think this simpler code may work :) – user15663873 Aug 28 '22 at 19:39
  • Hi again Nick, thanks for the help. My Stata code for grouping variables by ID is gen period_1 = date_eval < mdy(5,4,2020) preserve collapse period_1=period_1 count if period_1 and it gives me a number of individuals during that period. However, I get a different number if I use the SQL query in Python evals_period_1 = ps.sqldf('SELECT id, COUNT(date_eval) FROM df WHERE strftime(date_eval) < strftime("%m/%d/%Y",{}) GROUP BY id'.format('5/4/2020')). Am I grouping by ID differently in these two codes? Please let me know what you think. Thank you!! – user15663873 Aug 29 '22 at 07:57
  • Please start a new thread on Stack Overflow following the guidelines there about minimal reproducible examples. I can’t help with Python queries on any forum. – Nick Cox Aug 29 '22 at 08:08