I'm working on a project using logistic regression to predict student retention. The data were collected through three self-report instruments. We are trying to find out which predictors are powerful enough to predict the at-risk students. I came across some articles saying a balanced sample (50% stay, 50% dropouts) is desirable for such study, e.g. Glynn, J.G., Sauer, P.L., & Miller, T.E. (2003). Signaling Student Retention With Prematriculation Data, NASPA Journal, 41 (1), 41-67:
A problem however, is that the distribution of the dependent variable is likely to be highly skewed toward persistence. For example, if 85% of the analysis sample were persistors, a classification model that classified every student as a persistor would have a success rate of 85%, or would classify 85% of student correctly. To resolve this issue, the maintenance of relative balance between the number of dropouts and the number of persistors (about 50% each) in the analysis sample was desirable.
Is this true? Our sample only has about 25%-30% dropout students. Will this affect the results?