2

I am beginner in causal inference. I am seeking guidance to evaluate my research plan that aims to uncover the causal effect of variable X (treatment) on variable Y (outcome). Here, X represents the actual action as a result of decisions made within a household.

To achieve the least biased estimate, I propose the following strategy:

  1. Identifying Influential Variables: Enumerate variables other than X that may impact Y (addressing omitted-variable bias.)

  2. Identifying Control Variables for Household Characteristics: List control variables that account for household characteristics, addressing sample selection bias. (So far, I have compiled all related observable variables as long as I know.)

  3. Directed Acyclic Graph (DAG) Analysis: Utilize DAG to identify which variables should be controlled for to enhance the validity of the causal inference.

  4. Addressing Unobserved Heterogeneity: Propose methods to mitigate biases stemming from unobserved heterogeneity and simultaneous equation bias, by applying (1) Instrumental Variable, (2) sensitivity analysis, (3) incorporating observed variables that serve as proxies for unobserved heterogeneity. Hoping (1) will be success, but (2) and (3) are B plan if (1) failed.

  5. Fit the model.

I welcome any feedback or suggestions on my research approach, particularly regarding its effectiveness in yielding accurate and reliable causal inferences in the context of development economics or agricultural economics.

DrJerryTAO
  • 1,514
Sho
  • 23
  • Please provide the data structure, unit of analysis, and other details of the experimental design. – DrJerryTAO Jan 20 '24 at 06:32
  • Hello Dr TAO. I am using cross-sectional data. The unit is housheold i. X is the volume of material inputed for agriculture. Y is the dummy variable which takes 1 when at least one children in household i is working for farming. X could increase the labor demand which could leads to child labor. However, the household i probably decide volume X according to expected labor demnd not to make their children work. This is why I think simltaneous equation bias could occur. Since household characteristics like income level could also afffect on child labor, I have to control these effects. – Sho Jan 20 '24 at 06:57
  • See my answer. Could you elaborate what "the volume of material input" means? Is it a single continuous variable or a collection of many variables? In economics, many production factors are monetized, such as the present nominal value of all materials. – DrJerryTAO Jan 20 '24 at 09:03
  • Thank you so much Dr. TAO. I appreciate your introduction to causal inference.

    First of all, I would like to clear what I can answer now. X is the volume of fertilizer (kg/ha) used for farming in the household i. It may be sum of N, P, and K. Then it is unclear whether it is single continuous variable or collection of three continuous variables. Anyway, Y is binary (0/1) as I mention in the first post. What I intend to "simultaneous equation bias" is that Y also affects X as well as X affects Y.

    (continued)

    – Sho Jan 20 '24 at 09:29
  • Then, what I thought was as follows. “Even if I try to include as many observable variables as possible in my model, unobservable variables cannot be included. Therefore, this approach (adding as many related variables as possible) can reduce the omitted variable bias, but it cannot reduce it to zero. To address this “remaining” bias stemming from omitted variables (in other words, unobserved heterogeneity) as well as simultaneity, I have to combine instrumental variable method with adding as many variables.”

    (continued)

    – Sho Jan 20 '24 at 09:29
  • However, according to your advice, if I can find the appropriate Z (instrumental variable), adding many variables may no longer be necessary, or it could be harmful, even if these variables are related to household characteristics. Do I catch your point? – Sho Jan 20 '24 at 09:30
  • I am glad that you find my advice useful. In linear models, omitted variable results in bias only if correlated with X. As you have found out, linear regression models always have an error term E. Adding more predictors reduces the variance of E, shrinking the standard error of all coefficients. In causal inference, however, the task is not to minimize E but to ensure that E is not correlated with X so that X coefficient is unbiased. Using other outcomes of X than Y as predictors (as if using proxy of Y to predict Y) is harmful for causal inference although it likely increase R2. – DrJerryTAO Jan 20 '24 at 10:04
  • Your advice made my brain clear. Thank you so much Dr. TAO. – Sho Jan 20 '24 at 11:31
  • 1
    @Sho, you should consider accepting DrJerryTAO's answer by clicking on the tick symbol alongside the post. – User1865345 Jan 20 '24 at 14:04
  • This https://math.stackexchange.com/help/someone-answers shows how to accept one's answer. Is X the amount a land assessor recommends or a household actually used? This determines if X is exogeneous or endogenous. Three endogenous variables requires at least three instrument variables. It might be okay to use the sum weight of three fertilizers. Crop choice and land condition affect fertilizer requirement. If measured per hectare, total land size also matters in labor demand. However, I cannot see predicting child labor with fertilizer usage gives useful implications. Farming tech does. – DrJerryTAO Jan 20 '24 at 14:49
  • The utilization of modernized input (N, P, K) reflects one aspect of agricultural modernization. In the process of agricultural modernization, all farmers cannot stick to their land. In other words, some of the farmers have to leave agricultural sector in the future. My ultimate question is how we can foster the “sustainable out-migration” from agriculture in parallel with agricultural modernization. Child labor is one of the constraints for sustainable out-migration as it deprives children in farm households of their education which leads to poverty in the informal sector or urban slums. – Sho Jan 21 '24 at 07:28
  • What do you mean the "land assessor recommend"? The X is the total amount of the chemical fertilizer, which were collected in-person interview. I have already confirmed the amount of fertilizer(kg) affect labor demand(person*hours), controlling the land size(ha). As you mentioned, there is an endogeneity problem in X and Y because household can decide the input amount based on the expected labor demand and necessity to make their children work. The biggest difficulty is to find the appropriate instrumental variable. Any variables related to the household characteristics are not useful. – Sho Jan 21 '24 at 07:58

1 Answers1

1

Since you are a beginner in causal inference, you will find the following tutorials on basics in causal inference useful.

Without further details on the experimental design that prescribes the data structure, unit of analysis, and data generating process, it is very difficult to determine the accuracy and reliability of your research plan. In the context of agricultural economics, you need to address the following questions:

  • About X: Is it continuous or categorical? How is its value determined? Are the decisions by a household that result in X affected by any events that were out of control by the household? How are these external events distributed? These affect how you should adjust the treatment.
  • About Y: Is it continuous or categorical? Since reverse causality is a concern, is there another instrument variable that only affects X but not Y? Is there a temporal or spatial lag so that current Y cannot affect X in the past or neighboring households? These affect the model type and residual corrections.
  • About the experiment: Do you have data of individual households or their averages by region and time? Do you have cross-sectional or longitudinal data of households? Are there clusters and time series that render individual observations correlated?

In labor economics, we usually have income as the outcome variable. We can only observe income of those who work and cannot observe potential income of those who do not have jobs. This is the center problem of women labor participation that motivates Heckman's sample selection model. See Heckman, J. J. (1976). The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement, 5(4), 475–492. https://cir.nii.ac.jp/crid/1573668925609148544 and Bushway, S., Johnson, B. D., & Slocum, L. A. (2007). Is the magic still there? The use of the Heckman two-step correction for selection bias in criminology. Journal of Quantitative Criminology, 23(2), 151–178. https://doi.org/10.1007/s10940-007-9024-4. Very similar to women labor participation, I speculate that children participate in labor if their current contribution to household income exceeds the present value the difference in their future expected income with and without proper schooling. On the other hand, production output is a function of both labor, material, and land inputs. Land is usually constant within a rural family. When labor is the limiting factor, material no longer determines the output. Income is the output minus cost that includes cost of materials and outsourced labor for a farming family. The cost of using family members is usually internalized.

I fail to conceptualize how it is important to predict child labor with material input. Is the potential policy implication that we need to cut down agricultural material quantity or increase its price to reduce child labor? Nevertheless, theoretically it is a bad idea to control household income in your current design, since child labor increases income. Further, there are a few immediate concerns in your current plan:

  1. Enumerating variables other than X to find variables that may impact Y is useful in exploratory data analysis but can be dangerous in confirmatory analysis. If everything happen at random, out of 20 variables you screen, one of them is expected to show p < .05 as what you would determine as an influential variable. See Brodeur, A., Cook, N., & Heyes, A. (2020). Methods Matter: P-Hacking and Publication Bias in Causal Analysis in Economics. American Economic Review, 110(11), 3634–3660. https://doi.org/10.1257/aer.20190687. A better method is a literature review to determine these other variables.

  2. Adding more variables to a Y ~ X equation may not resolve the sample-selection bias. The sample-selection issue has to be addressed with techniques similar to the instrument variable approach: You need to estimate another equation that explains the generation of X and usually include an instrument variable as one of the predictors. Techniques of regression adjustment, inverse-probability weights, inverse-probability-weighted regression adjustment, augmented inverse-probability weights, and matching on the propensity score may be helpful. See Toomet, O., & Henningsen, A. (2008). Sample selection models in R: Package sampleSelection. Journal of Statistical Software, 27(7). https://doi.org/10.18637/jss.v027.i07 and Stata. (2014). Introduction to treatment effects for observational data. https://www.stata.com/manuals13/teteffectsintro.pdf. If X affects not only Y but also some other variables W, however, W should not be used as an predictor of Y in parallel with X.

  3. Scratching simple causal links between variable helped me specify model formula. Formal directed acyclic graph instead made the process too complex for me. I have read a few papers in graphical causal analysis and found that they repeat what I learnt from mathematical equations. Unless required, I do not think that you have to supply formal diagrams to justify the model specification.

  4. It is unclear what you refer to as "unobserved heterogeneity" and "simultaneous equation bias". Instrument variable Z removes bias in causal inference in Y ~ X, but you need to conceptually justify that Z affects only X but not Y. Identifying an useful instrument variable is the most valuable contribution of many causal inference studies. Sensitivity analysis is to assess whether the inference is sensitive to the model specification. You need to acquire data from different sources, separate data in different groups, or estimate different models out of the same data to see whether the direction and magnitude of effects remain the same.

  5. Before fitting a model, one must first address all the above questions. In causal inference, the objective is not to fit a perfect model that predict the outcome with the least error. Instead, it is to retrieve the causal impact of X on Y that is usually represented by one coefficient. That means one may need to remove certain predictors of Y that X also affects. This will sacrifice the goodness of fit for reducing the bias in the coefficient of interest. For a paper that include many of the techniques mentioned above, see Ma, L., Montgomery, A. L., Singh, P. V., & Smith, M. D. (2014). An empirical analysis of the impact of pre-release movie piracy on box office revenue. Information Systems Research, 25(3), 590–603. https://doi.org/10.1287/isre.2014.0530.

If the outcome variable is binary, we also need to understand that

  • The coefficients are intrinsically standardized by an unknown factor that never stays the same across different samples, groups, and model specifications. See Williams, R., & Jorgensen, A. (2023). Comparing logit & probit coefficients between nested models. Social Science Research, 109, 102802. https://doi.org/10.1016/j.ssresearch.2022.102802.
  • Some simulation papers show that in binary regression models, omitting variables that are either correlated with or independent from the treatment bias the coefficient. See Keele, L., & Park, D. K. (2006, March 3). Difficult choices: An evaluation of heterogenous choice models. Meeting of the American Political Science Association, Chicago, IL. My simple demonstration shows that this issue is only important if the objective is to retrieve the absolute value of the original coefficients on the latent response variable in the ideally imagined data generating process. In practice, however, we can only obtain mysteriously standardized coefficients. Instead of regression coefficients, a much more useful effect-size measure in binary regression is average marginal effects: the change in the probability of Y = 1 upon a unit change in X. Omitting variables independent from X, although it changes the standardized coefficients of X, will not bias the average marginal effects of X on the outcome probability.
  • There is also a heteroscedasticity issue in binary regression. See Williams, R. (2010). Fitting heterogeneous choice models with oglm. The Stata Journal, 10(4), 540–567. https://doi.org/10.1177/1536867X1101000402. In essence, fitting the scale equation allows a complex but constrained interaction between all predictors. However, no modeling techniques can differentiate heteroscedasticity from a wrongly specified location equation. See a demonstration https://www.r-bloggers.com/2013/02/the-problem-with-testing-for-heteroskedasticity-in-probit-models/.
DrJerryTAO
  • 1,514