Yunyan Duan

Data Science | Computational Linguistics | Cognitive Science

Resources for learning causal inference

Causal inference is a statistical method that aims at determining the causal effect of a factor on an outcome. While common statistical tools (e.g. regression) describe the correlative relationship between variables, causal inference goes beyond correlation and tries to estimate the causal influence of one variable on the other. It could be a difficult philosophy question to say what causal actually means, but we can simply say that variable A (called an ‘intervention’ or a ‘treatment’) has a causal effect on variable B (called an ‘outcome’) if B would be different without A. Counterfactual is thus a useful concept to think about causality. Nowadays, data scientists may need causal inference to answer questions about the influence of an intervention from observational data, if an A/B test of that intervention is not available or too expensive.

A famous example that illustrates the drawbacks of only looking at correlational relationships between variables in the data is the Simpson’s paradox. Since confounding factors widely exist in observational data, misinterpretations based on correlation are very likely. It may not be a problem if the task is not to interpret, and the task of prediction may actually benefit from correlative features as long as the data and the model can generalize. But if interpretation is what one really concerns, then they should look for the causal effect of the intervention on the outcome. Whenever possible, an A/B test should be carried out and conclusions should be based on these experimental data, as this approach provides a gold standard of the intervention’s effect. If an A/B test is not possible, such as studying the influence of a state-wise policy, or studying some economic phenomena, then causal inference should be adopted instead of other commonly-used statistical tools. Even so, one should always be cautious when they draw causal conclusions from observational data, as prerequisites of a causal inference method may or may not be satisfied.

Here is a list for beginners who may want to use causal inference in their work:

  • I highly recommend one starts with this series of blog posts, as these posts are very beginner-friendly and give an overview of causal inference.

  • For a high-level understanding of causal inference, I recommend The Book of Why by Judea Pearl. You may also check out the author’s page for more books/tutorials on related topics.

  • There are many powerful Python and R packages. Here are my picks:
    • CausalML, a Python package from Uber. Easy to use, with many causal inference methods.
    • EconML and DoWhy, two Python packages from Microsoft. Great documentation and theoretical explanations.
    • grf, an R package that implements a tree-based algorithm called ‘causal forest’. See these papers for more details (paper1,paper2).
  • The above-mentioned packages provide good tutorials and examples. In addition, this page lists several industrial use cases with slides and code as presented in KDD 2021. This page shows a detailed example of running meta-learners using CausalML.

  • Some resources in Chinese:
    • This blog for introduction to causal inference, especially propensity score matching.
    • This series of notes for a glimpse of The Book of Why and more illustrations of the theory.
    • This talk for a general introduction to causal inference.