This repository contains the implementation of the approach discussed in Event causality identification with synthetic control.
The paper was presented at EMNLP 2024.
- Create a .envfile in the root of the directory with theOPENAI_API_KEYenvironment variable.
OPENAI_API_KEY="<your openai key>"
- Optionally, add Langfuse API keys to .envto enable tracing for OpenAI calls.
LANGFUSE_SECRET_KEY="<langfuse secret key>"
LANGFUSE_PUBLIC_KEY="<langfuse public key>"
LANGFUSE_HOST="<langfuse host>"
- Download the COPES dataset to data/COPES.json.- curl -o data/COPES.json -LJ https://github.com/HKUST-KnowComp/COLA/raw/refs/heads/master/COPES_data/COPES.json
 
- Download the TinyStories dataset to data/TinyStoriesV2-GPT4-train.txt.- curl -o data/TinyStoriesV2-GPT4-train.txt -L https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
 
- condaneeds to be installed.
- Create the virtual environment using conda env create -f environment.yml
- Convert TinyStoriesV2-GPT4-train.txtto parquet, by runningpython main.py setup-tiny-stories-parquet
- Create the BM25 index by running python main.py setup-tiny-stories-corpus
Note that all indices are 0-indexed
| Strategy | Description | 
|---|---|
| gpt4 | (Baseline) GPT4 Zeroshot Inference | 
| sc | (Synthetic Control) GPT3.5 Synthetic Control | 
| sc4 | (Synthetic Control) GPT4 Synthetic Control | 
Run outputs are logged in output/<strategy>/
<test_case_id> are all IDs from COPES.
python main.py run-testcase-event <test_case_id> <event_id> <strategy>
e.g. python main.py run-testcase-event 0 0 sc
python main.py run-one <test_case_id> <strategy>
python main.py run_from_list <path_to_json> <strategy
path_to_json must be a file containing a single JSON array of indexes (e.g. [1,2,3,4])
python main.py print-testcases <path_to_json>
- Deadlocks have been observed to occasionally occur within DuckDB (or the Python DuckDB driver), causing corpus retrieval to fail.