Variable selection with a theoretical DAG vs algorithmically discovered DAG

Question

I'm analysing data from an electronic health record and determining what variables to include in a model to close back doors and omit bias.

I've read that it is important to have a subject specific expert to determine variables based on principle. However, despite this, there may be spurious colliders and confounders. I was thinking of some hypothetical situations and was wondering if there was any guidance on them.

Should a variable be included if it theoretically justified, but has a small effect size and is statistically insignificant? Say, age was the variable, it feels like it would be controversial to exclude such a variable with how ubiqutous it is amongst existing models
If the theoretical DAG indicates there is a collider, but the data suggests it is a confounder, should it still be controlled for?
The same as 2 but the reverse, should a theoretical confounder be left unadjusted?

I feel in this case, I should still report the theoretical DAG, but note the diversions and justifications, then follow the data to obtain an unbiased estimate; although, reading this is making me question this.

What do you mean mathematically by "the data suggests [a variable] is a confounder/collider?" Is this algorithmic DAG based on PC-based causal discovery or a score/optimization-baesd method? — chang_trenton, Aug 31 '23 at 16:18
Also, philosophically, causal inference implies a different way of thinking about data than "standard" (loosely defined) predictive modeling. One way to frame a causal DAG is an encoding of one's/a domain expert's belief in the state of the world, including counterfactuals. Without causal assumptions, observational data only reflects "what did happen." Thus, I strongly prefer the usage of subject-specific expertise to design DAGs, even if a "ground truth" DAG is only knowable under omnipotence. — chang_trenton, Aug 31 '23 at 16:23
So, I'm still new to this but, I was referring to the PC algorithm, which in pcalg in R measures association with Pearsons. So if this algorithm showed that there confounders when the theory suggests colliders and vice versa, should I go against this knowledge as, despite the theory, if there are empirical links, backdoors will still be left open? — Geoff, Aug 31 '23 at 16:28
Regarding the philosophical point, I assumed, while there is knowledge to suggest counterfactuals, there can still be associations that show the opposite, which lead to open backdoors. For the sake of unbiasing the estimate, even if associations are spurious, they still need handling, despite what theory suggests. — Geoff, Aug 31 '23 at 16:32

score 3 · Answer 1 · answered Aug 31 '23 at 17:33

This is going to be a regrettably subjective answer, but I think the question is reasonable.

The short answer is that there is no clear rule for what to do. At the end of the day, for the purposes of conducting causal inference, the modeler (you) has to make a judgment call on whether the results of causal discovery (in the case of PC, the implied conditional independencies) on the observed data overrides existing expert knowledge in your domain.

To make this more concrete, let's say in one sub-step of PC, for variables $X, Y, Z$, you eliminate an edge between $X, Y$ because $X \perp Y \mid Z$ on your sample, but this contradicts expert knowledge (this is a simpler case than collider vs. confounder, but I think it illustrates the point). Is $X \perp Y \mid Z$ true in general (PC was right)? Or is it a quirk of your sample? How would you answer this question?

I'd be very skeptical of any blanket rules of thumb for which one to prefer, and don't have much to offer here except "think carefully" and "consult with domain experts." I'm not aware of any, but it's possible that my knowledge is incomplete here.

I'm not sure how (without expert knowledge) one can algorithmically distinguish the bias from your sample in particular (i.e., bad finite-sample luck) and the bias due to a misspecified DAG (i.e., incorrect assumptions about the world), because (philosophically) the DAG itself encodes assumptions. That is; at some point, to perform causal inference, one needs to choose a set of assumptions about the world. In my opinion, this is one of the distinguishing factors between causal inference and predictive modeling (e.g., supervised machine learning).

I think this touches on your main questions (i.e., what do I include? Do I treat X as a confounder vs. collider?), but let me know if I'm missing something.

In practice, it's probably impossible to control for literally all confounders -- you're unlikely to eliminate all unobserved confounding/collider bias. This is no reason to panic, because by choosing a DAG, you make transparent the assumptions made about the data-generation process. If you choose to believe the DAG generated by PC over the DAG generated by expert knowledge (or vice versa), then under each DAG you can check the requisite backdoor criteria for building a model for whatever causal estimand you care about. I don't think it hurts to fit a model under each DAG as a robustness check, but questions of analytic strategy are probably best left to your collaborators.

This is a great answer, thanks for this! I think we may be on the same page, but let me know if not. Referring to the case where the evidence suggests confounder and expert suggests collider, my question was: regardless of what is known prior, if there is a confounder, and the zero conditional mean assumption is violated, and so to yield unbiased estimates, we should disregard the expert opinion here? — Geoff, Sep 01 '23 at 09:27
So, my general point is, I'm not sure why theoretical knowledge should determine variable selection, when the goal is unbiased estimators. I suppose an analgous point would be a fitting robust standard errors in the presence of heterscedasticity, when theory or prior research suggests otherwise; despite what this prior information is, why would it be beneficial to leave standard errors biased and have inefficient estimators? — Geoff, Sep 01 '23 at 09:32
My issue is the same because we end up using a finite sample to make a claim about unbiasedness. By definition, unbiasedness + efficiency depend on fixed properties (from a frequentist POV) of the population — so the argument for disregarding "expert opinion" holds in the limit of infinite data. Given that PC relies on statistical testing, without, say, sample size info for starters, I'm very skeptical of discarding expert opinion. Happy to discuss more in chat. — chang_trenton, Sep 01 '23 at 16:39
I wouldn't call myself a causal discovery expert, so I'll defer to Ch. 25-6 of this explanation, which more effectively makes my case: https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch25.pdf — chang_trenton, Sep 01 '23 at 16:40

Scriddie · Answer 2 · 2023-09-04T13:04:50.543

1

Note of caution when applying causal discovery

This is clearly somewhat subjective and case-dependent, but let me caution you against putting too much trust in the results of causal discovery methods. This may seem discouraging, but given how gigantic and consequential the task of automated causal discovery really is, it should not come as a surprise. Having worked on causal discovery benchmarking, I'd recommend to treat the results of any causal discovery method with great caution. The field is not at a point where the methods can be trusted in real-world settings (mostly because the assumptions are not realistic). If they are applied well, they may be able to give an indication when there is no domain knowledge, but they certainly do not come close to actual domain expertise and may well do more harm than good.

It is not easy to provide evidence since negative results don't tend to get published as much - so let me ask instead:

Do your data match the model class of the causal discovery algorithms?
Are there credible examples of successful and useful applications of causal discovery in your domain?

The answers to these questions should be a useful prior on how much trust to put into your algorithmically discovered casual structures.

edited Sep 04 '23 at 13:04

answered Sep 04 '23 at 12:54

Scriddie

2,244
6
13

In the question I was just making hypothetical situations, I haven't used any of the algorithms on data yet. If a PC algorithm detects a confounder, why wouldn't it be useful to control for it? – Geoff Sep 04 '23 at 13:12
My point is that the output of the PC algorithm may itself be questionable. For example, most implementations rely on linear conditional independence tests, which implies linear functional relationships. I'm also not sure the implementations would necessarily work for mixed data types (e.g. continuous and discrete variables). On top of that, there may be finite sample issues, outliers and data errors, and so on. So even just running PC (or anything else) on a moderately complex real-world data set is a challenge, and I'd be very careful about reading too much into the results. – Scriddie Sep 04 '23 at 18:26
I see what you're saying, I think from this I'll maybe start with the PC DAG then let the subject expert overule. With confounding, though, if it was a result of outliers or data errors, wouldn't I still want to control for one if to partial its effect out? – Geoff Sep 05 '23 at 09:19
If you do have a subject matter expert at hand, giving them the last say sounds like a very good idea. I'd say reliable real-world causal discovery is a bit like nuclear fusion - big potential and some promising signs, but it's just not there yet.
There is a cost to controlling for variables if they open backdoor paths instead of closing them. So with an unreliable method (read: any causal discovery method at this point in time), you are running some danger of making such a mistake. Not including a confounder would also be a problem, of course - so there isn't an easy answer unfortunately.
– Scriddie Sep 05 '23 at 14:26

Variable selection with a theoretical DAG vs algorithmically discovered DAG

2 Answers2

Note of caution when applying causal discovery