Skip to content

Update causality intro #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 9 additions & 32 deletions sections/0_causality/causal_intro/article/intro-causality.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,13 @@ Some are historical, examining a *single* history, while others are contemporary
Some fields already have a long tradition with causal inference, while other fields have paid less attention to causal inference.
We believe that science studies, regardless of whether that is scientometrics, science of science, science and technology studies, or sociology of science, have paid relatively little attention to questions of causality, with some notable exceptions [e.g., @aagaard_considerations_2017; @glaser_governing_2016].

We here provide an introduction to causal inference for science studies.
We here provide an introduction to causal inference for science studies, with a particular focus on effects on the impact of Open Science.
Multiple introductions to structural causal modelling of varying complexity already exist [@rohrer2018; @arif2023; @elwert2013].
@dong_beyond_2022 introduce matching strategies to information science.
We believe it is beneficial to introduce causal thinking using familiar examples from science studies, making it easier for researchers in this area to learn about causal approaches.
We avoid technicalities, so that the core ideas can be understood even with little background in statistics.
We first introduce the general approach, which we then briefly illustrate in three short case studies.
In addition, we provide some extensive descriptions of approaching causality in three specific case studies in academic impact (on the effect of [Open Data on citations](../../open_data_citation_advantage.qmd)), in [societal impact](../../social_causality.qmd) and in economic impact (on the effect of [Open Data on Cost Savings](../../open_data_cost_savings.qmd)).

## The fundamental problem

Expand All @@ -82,12 +84,10 @@ For instance, non-compliance in experimental settings might present difficulties
Additionally, scholars might be interested in identifying mediating factors when running experiments, which further complicates identifying causality [@rohrer2022].
In other words, causal inference presents a continuum of challenges, where experimental settings are typically easiest for identifying causal effects---but certainly no panacea---and observational settings are more challenging---but certainly not impossible.

In this paper we introduce a particular view on causal inference, namely that of structural causal models [@pearl_causality_2009].
In this Open Science Impact Indicator Handbook we introduce a particular view on causal inference, namely that of structural causal models [@pearl_causality_2009].
This is a relatively straightforward approach to causal inference with a clear visual representation of causality.
It should allow researchers to reason and discuss about their causal thinking more easily.
In the next section, we explain structural causal models in more detail.
We then cover some case studies based on simulated data to illustrate how causal estimates can be obtained in practice.
We close with a broader discussion on causality.
We explain structural causal models in more detail in the next section.

# Causal inference - a brief introduction {#sec-causal-inference}

Expand Down Expand Up @@ -614,6 +614,9 @@ The example highlights that relatively simple DAGs are often sufficient to uncov
For instance, if we had not measured *Field*, controlling for it and identifying the causal effect would become impossible.
In that case, it is irrelevant whether there are any other confounding effects between *Citations* and *Open data*, since those effects do not alleviate the problem of being unable to control for *Field*.

The discussion here focuses specifically on illustrating the general principles.
In the case study on the effect of [Open Data on citations](../../open_data_citation_advantage.qmd) we examine this in greater detail and with a higher degree of realism.

## The effect of Open data on Reproducibility {#sec-open-data-on-repro}

Suppose we are interested in the causal effect of *Open data* on *Reproducibility*.
Expand Down Expand Up @@ -800,7 +803,7 @@ Taking measurement seriously can expose additional challenges that need to be ad

The study of science is a broad field with a variety of methods.
Academics have employed a range of perspectives to understand science's inner workings, driven by the field's diversity in researchers' disciplinary backgrounds [@sugimoto2011; @liu2023].
In this paper we highlight why causal thinking is important for the study of science, in particular for quantitative approaches.
In this chapter we highlight why causal thinking is important for the study of science, in particular for quantitative approaches.
In doing so, we do not mean to suggest that we always need to estimate causal effects.
Descriptive research is valuable in itself, providing context for uncharted phenomena.
Likewise, studies that predict certain outcomes are very useful.
Expand Down Expand Up @@ -877,32 +880,6 @@ For example, when developing an interview guide to study a particular phenomenon
Furthermore, even if qualitative data cannot easily quantify the precise strength of a causal relationship, it may corroborate the structure of a causal model.
Ultimately, combining quantitative causal identification strategies with direct qualitative insights on mechanisms can lead to more comprehensive evidence [@munafò2018; @tashakkori2021], strengthening and validating our collective understanding of science.

# Acknowledgements {.unnumbered}

We thank Ludo Waltman, Tony Ross-Hellauer, Jesper W. Schneider and Nicki Lisa Cole for valuable feedback on an earlier version of the manuscript.
TK used GPT-4 and Claude v2.1 to assist in language editing during the final revision stage.

# Author contributions {.unnumbered}

Thomas Klebel: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing - original draft, and Writing - review & editing.
Vincent Traag: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing - original draft, and Writing - review & editing.

# Competing interests {.unnumbered}

The authors have no competing interests.

# Funding information {.unnumbered}

The authors received funding from the European Union’s Horizon Europe framework programme under grant agreement Nos. 101058728 and 101094817.
Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Executive Agency.
Neither the European Union nor the European Research Executive Agency can be held responsible for them.
The Know-Center is funded within COMET—Competence Centers for Excellent Technologies—under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria.
COMET is managed by the Austrian Research Promotion Agency FFG.

# Data and code availability {.unnumbered}

All data and code, as well as a reproducible version of the manuscript, are available at [@klebel_code].

# Theoretical effect of Rigour on Reproducibility {#sec-appendix-rigour-on-reproducibility .appendix}

There is a direct effect of *Rigour* on *Reproducibility* and a indirect effect, mediated by *Open data*.
Expand Down
6 changes: 2 additions & 4 deletions sections/0_causality/open_data_citation_advantage.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,15 @@ affiliations:
# The effect of Open Data on Citations {#open-data-citation-advantage .unnumbered}

::: {.callout collapse="true"}

## History

| Version | Revision date | Revision | Author |
|---------|---------------|-------------|------------|
| 1.1 | 2024-11-27 | Revisions | V.A. Traag |
| 1.0 | 2024-11-13 | First draft | V.A. Traag |

:::

We here provide some idea of what it would take to try to infer the causal effect of one specific Open Science practice on citation impact. In particular, we consider the effect of Open Data on citation impact. That is, papers that share their data might be more likely to be cited. This is something that has been called the "Open Data Citation Advantage", and in the PathOS scoping review of the academic impact of Open Science [@klebel_academic_2024], evidence was found for a small positive effect of sharing data.
We here provide some idea of what it would take to try to infer the causal effect of one specific Open Science practice on citation impact. In particular, we consider the effect of [Open Data](../1_open_science/prevalence_open_fair_data_practices.qmd) on [citation impact](../2_academic_impact/citation_impact.qmd). That is, papers that share their data might be more likely to be cited. This is something that has been called the "Open Data Citation Advantage", and in the PathOS scoping review of the academic impact of Open Science [@klebel_academic_2024], evidence was found for a small positive effect of sharing data.

Inferring the causal effect of open data on citation impact is not straightforward and cannot easily be done in an experimental setting. Although an experiment study could in principle be done, it would require researchers to participate and follow the experimental, randomised "treatment" of sharing data or not, which will be challenging, especially where more and more data policies mandate that data should be shared. This means that, barring such experiments, we have to rely on observational studies of citations to publications that have (not) shared their data. Note that we here only focus on whether data was shared or not, not whether the data is FAIR or not, or the extent to which it is FAIR, although that might be a relevant confounder to consider.

Expand All @@ -35,7 +33,7 @@ We will try to produce a relevant structural causal model by going through the f
1. Consider causal factors that affect or are affected by X or Y
2. Consider effects between the identified factors.

Let us start by considering factors that have a causal effect on the number of citations to a paper. As suggested above, there are many factors that correlate with citations [@onodera2015]. The scientific field and the year of publications are two very clear causal factors. One other relevant aspect is obviously something like the quality or relevance of the research: higher quality or research that is more relevant to more researchers, will be more likely to be cited. Unfortunately, such a quality or relevance is not directly observable. Where something is published, i.e. which journal, is likely to have a causal effect on the citations [@traag2021]. In addition, there are most likely some reputational effects of the author and the institution [@way2019]. Finally, (international) collaboration might be likely to have some effect on citations as well, potentially mediated by network effects.
Let us start by considering factors that have a causal effect on the number of citations to a paper. As suggested above, there are many factors that correlate with citations [@onodera2015]. The scientific field and the year of publications are two very clear causal factors, and are usually also considered when [normalising citations](../2_academic_impact/citation_impact.qmd#avg.-total-normalised-citations-mncs-tncs). One other relevant aspect is obviously something like the quality or relevance of the research: higher quality or research that is more relevant to more researchers, will be more likely to be cited. Unfortunately, such a quality or relevance is not directly observable. Where something is published, i.e. which journal, is likely to have a causal effect on the citations [@traag2021]. In addition, there are most likely some reputational effects of the author and the institution [@way2019]. Finally, (international) collaboration might be likely to have some effect on citations as well, potentially mediated by network effects.

Let us then consider factors that have a causal effect on the sharing of open data. One clearly relevant factor is the open data policy of the journal where the publication is published: if a journal has a clear open data policy that requires authors to make data available (e.g. [PLOS’ Data Policy](https://journals.plos.org/plosone/s/data-availability)), publications in that journal might be more likely to be make their data available. Similarly, if research is funded by a funder that has a clear open data policy (e.g. [Wellcome Trust’s Data Policy](https://wellcome.org/grant-funding/guidance/policies-grant-conditions/data-software-materials-management-and-sharing-policy)), the data might be more likely to be made available. Funding might also make it more likely that authors make their data openly available due to an increase in resources (e.g. data support). Similarly, institutional resources (e.g. data support or training) might help make data open. Some fields may have an academic culture in which scholars are more accustomed to making their data openly available. In addition, some research approaches in a field might be more likely to make their data available than others (e.g. it might be easier to share anonymised quantitative data from surveys as opposed to thick interview data). Lastly, open data has increasingly become a standard, meaning that more recent publications might be more likely to share their data.

Expand Down
2 changes: 2 additions & 0 deletions sections/2_academic_impact/citation_impact.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ affiliations:

The citation impact of publications reflects the degree to which they have been taken up by other researchers in their publications. There are long-standing discussions about the interpretation of citations, where two theories can be discerned [@bellis2009]: a normative theory, proposing citations reflect acknowledgements of previous work [@merton1973]; and a constructivist theory, proposing citations are used as tools for argumentation [@latour1988]. Overall, citation impact seems to be most closely related to the relevance of the work for the academic community and should be distinguished from other considerations of scientific quality, where the relationship is less clear [@aksnes2019].

Although it is out of scope to provide suggestions for causal inference of all possible Open Science aspects on citation, we discuss one case on the effect of [Open Data on citations](../0_causality/open_data_citation_advantage.qmd).

## Metrics

Citations are affected by two major factors, that we expect to be irrelevant for considerations of impact: the field of research, and the year of publication[^pub-year]. That is, some fields, such as Cell Biology, are much more citation intensive than other fields, such as Mathematics. Additionally, publications that were published in 2010 have had more time to accumulate citations than publications published in 2020. Controlling for these factors[^normalisation-factors] is resulting in what are often called “normalised” citation indicators [@waltman2019]. Although such normalised citation indicators are more comparable across time and field, they are sometimes also more opaque. For that reason, we explain both normalised metrics and “raw”, non-normalised, citation metrics.
Expand Down
Loading