The ACORN repository provides a neat and simple way to assess and visualise the differences between different algorithms that generate causal orders from data
- About ACORN
- Usage
- Getting Started
2.1 Python Version
2.2 Installation
2.2.1 Clone the Repository
2.2.2 Complete Setup of Causal River Datasets
2.2.3 Setup of Your Own Datasets - Repository Structure
4.1 algorithms/
4.2 data/
4.3 external/
4.4 results/
4.5 utils/ - Notes
To test your have a working installation of the repository. Try running the following. It may take quite some time to run.
python run.pyOr do the following step-by-step process.
python generate_list_of_algorithms.py
# This file finds the algorithms within the algorithms/ directory and makes a list.
# The output is written to the algorithms/ directory in algorithm_list.txt.
python run_all_algorithms_on_dataset.py
# Run all algorithms in the algorithm_list.txt on the selected dataset
python print_results.py
# Run the main method of print_results.pyFor more advanced use cases or to make your own implementations you can find more detail in the relevant Utilities section.
For setting up your own datasets see them relevant Installation Section.
The repository was tested using the following version of Python.
Python 3.12.4
You can check your own Python version by running this command in your terminal:
python --versionTo get started, clone the repository and install the necessary Python packages using the requirements.txt file. (It is recommended to install packages and run all scripts within a virtual environment to avoid dependency conflicts.)
git clone https://github.com/thomas-dsl-johnson/ACORN.git
cd ACORN
pip install -r requirements.txtWe now have the following file structure. To see a full explanation of the file structure, go to 📂 Repository Structure.
.
├── algorithms/ ...
├── data/ ...
├── external/ ...
├── README.md
├── requirements.txt
├── run.py
└── utils/ ...
If you do not require the Causal_River Bavaria and East Germany datasets, you can skip this step. Let's look at the data/ directory:
data
├── ground_truth_available
│ ├── Causal_River
│ │ ├── Bavaria/ ...
│ │ ├── East Germany/ ...
│ │ └── Flood/ ...
│ └── IT_monitoring
│ ├── Antivirus_Activity/ ...
│ ├── Middleware_oriented_message_Activity/ ...
│ ├── Storm_Ingestion_Activity/ ...
│ └── Web_Activity/ ...
└── ground_truth_not_available
└── S&P500
├── sp500/ ...
└── sp500_5_columns/ ...
The Causal_River/ directory has the Flood/ dataset directory but is missing the Bavaria/ and East Germany/ directories that appear in its repository structure explanation. This is because the CausalRiverBavaria and CausalRiverEastGermany datasets are too large for this repository. We will need to download them from the original CausalRivers GitHub repository.
# 1. Clone the following submodules: 1. Causal River, 2. Causal Graph Recovery from Casual Order
git submodule update --init --recursive
cd external/causal_rivers
# 2. Follow the install steps for Causal River
./install.sh
conda activate causalrivers
python 0_generate_datsets.py
# 3. Create and Populate the Bavaria and East Germany Directories inside data/ground_truth_available/Causal_river/
cd ../..
mkdir data/ground_truth_available/Causal_River/Bavaria
mkdir data/ground_truth_available/Causal_River/East\ Germany ─╯
cp external/causal_rivers/product/rivers_bavaria.p data/ground_truth_available/Causal_River/Bavaria
cp external/causal_rivers/product/rivers_east_germany.p data/ground_truth_available/Causal_River/East\ Germany
cp external/causal_rivers/product/rivers_ts_bavaria.csv data/ground_truth_available/Causal_River/Bavaria
cp external/causal_rivers/product/rivers_ts_east_germany.csv data/ground_truth_available/Causal_River/East\ GermanyWe now have:
Causal_River
├── Bavaria
│ ├── rivers_bavaria.p
│ └── rivers_ts_bavaria.csv
├── East Germany
│ ├── rivers_east_germany.p
│ └── rivers_ts_east_germany.csv
└── Flood/ ...
We must now preprocess the .csv files using the Causal Graph Recovery from Causal Order Repository.
cd external/recover_causal_graph_from_causal_order/generate_ground_truthUsing your favourite editor, change the following constants in process_causalriver.py
ROOT_DIR = os.getcwd()
DATA_PATH = os.path.join(ROOT_DIR, "data/ground_truth_available/Causal_River", "East Germany")
input_filename = DATA_PATH + "/rivers_ts_east_germany.csv"
output_filename = DATA_PATH + "/rivers_ts_east_germany_preprocessed.csv"Run the file.
python process_causalriver.pyNow repeat for Bavaria.
ROOT_DIR = os.getcwd()
DATA_PATH = os.path.join(ROOT_DIR, "data/ground_truth_available/Causal_River", "Bavaria")
input_filename = DATA_PATH + "/rivers_ts_bavaria.csv"
output_filename = DATA_PATH + "/rivers_ts_bavaria_preprocessed.csv"Run the file.
python process_causalriver.pyWe now have:
Causal_River
├── Bavaria
│ ├── rivers_bavaria.p
│ ├── rivers_ts_bavaria.csv
│ └── rivers_ts_bavaria_preprocessed.csv
├── East Germany
│ ├── rivers_east_germany.p
│ ├── rivers_ts_east_germany.csv
│ └── rivers_ts_east_germany_preprocessed.csv
└── Flood/ ...
Now we will create the summary_matrix.npy files for Bavaria and East Germany. We will edit generate_causalriver_summary_matrix.py
ROOT_DIR = os.getcwd()
DATA_PATH = os.path.join(ROOT_DIR, "data/ground_truth_available/Causal_River", "East Germany")
input_data_filename = DATA_PATH + "/rivers_ts_east_germany_preprocessed.csv" # Make sure this path is correct
input_label_filename = DATA_PATH + "/rivers_east_germany.p" # Ground truth graph data
output_matrix_filename = DATA_PATH + '/summary_matrix.npy'Then run the file.
python generate_causalriver_summary_matrix.pyNow, repeat for Bavaria.
ROOT_DIR = os.getcwd()
DATA_PATH = os.path.join(ROOT_DIR, "data/ground_truth_available/Causal_River", "Bavaria")
input_data_filename = DATA_PATH + "/rivers_ts_bavaria_preprocessed.csv" # Make sure this path is correct
input_label_filename = DATA_PATH + "/rivers_bavaria.p" # Ground truth graph data
output_matrix_filename = DATA_PATH + '/summary_matrix.npy'Then run the file.
python generate_causalriver_summary_matrix.pyWe now have:
Causal_River
├── Bavaria
│ ├── rivers_bavaria.p
│ ├── rivers_ts_bavaria.csv
│ ├── rivers_ts_bavaria_preprocessed.csv
│ └── summary_matrix.npy
├── East Germany
│ ├── rivers_east_germany.p
│ ├── rivers_ts_east_germany.csv
│ ├── rivers_ts_east_germany_preprocessed.csv
│ └── summary_matrix.npy
└── Flood/ ...
For completeness, we now we need to generate the causal_order.txt files. We will edit generate_order_from_matrix.py
ROOT_DIR = os.getcwd()
###
### Unchanged Code
###
DATA_PATH = os.path.join(ROOT_DIR, "data/ground_truth_available/Causal_River", "East Germany")Then run the file.
python generate_order_from_matrix.pyThen change for Bavaria.
ROOT_DIR = os.getcwd()
###
### Unchanged Code
###
DATA_PATH = os.path.join(ROOT_DIR, "data/ground_truth_available/Causal_River", "Bavaria")Then run the file.
python generate_order_from_matrix.pyWe are done. The Causal_River file structure should now match as it appears in the Repository Structure
If you have your own dataset then place it in data/ and read the instructions below.
Ensure the following data is correctly formatted and placed appropriately within the data/ground_truth_available directory:
- Causal Order: The causal order of variables should be stored in a
causal_order.txtfile as a Python list. - Ground Truth Summary Matrix: The ground truth summary matrix must be in a
summary_matrix.npyfile, saved as a NumPy array. - Dataset: Your dataset should be in a
.csvfile.
See below for how the file structure, causal_order.txt, and summary_matrix.npy should appear.
Ensure the following data is correctly formatted and placed appropriately within the data/ground_truth_not_available directory:
- Dataset: Your dataset should be in a
.csvfile.
Then run utils/generate_data_when_ground_truth_not_available.py to generate the causal_order.txt, and summary_matrix.npy files. Here, we are using DirLiNGAM to create a 'synthetic' ground truth. Replace the argument with the location of your datset.
cd utils
python generate_data_when_ground_truth_not_available.py data/ground_truth_not_available/Dataset/data.csvSee below for check how the file structure, causal_order.txt, and summary_matrix.npy should appear.
File Structure Example:
data/
├── ground_truth_available
│ └── Dataset1/
│ ├── causal_order.txt
│ ├── dataset1.csv
│ └── summary_matrix.npy
│ ...
└── ground_truth_not_available
└── Dataset2/
├── causal_order.txt
├── dataset2.csv
├── model.pkl
└── summary_matrix.npy
...
causal_order.txt Example:
[0, 1, 2, 3, 5, 4, 6]
summary_matrix.npy Example:
[[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 1, 0]]ACORN
├── README.md
├── algorithms/ ...
├── data/ ...
├── external/ ...
├── requirements.txt
├── results/ ...
├── run.py
└── utils/ ...
algorithms
├── algorithm_list.txt
├── causal_order
│ ├── causal_order_result.py
│ ├── generic_causal_order_algorithm.py
│ ├── new
│ │ ├── direct_lingam_causal_order_algorithm_adding_nodes_in_batches_of_two.py
│ │ ├── direct_lingam_causal_order_algorithm_adding_nodes_in_batches_of_x.py
│ │ ├── direct_lingam_causal_order_algorithm_lookup.py
│ │ ├── direct_lingam_causal_order_algorithm_no_updates.py
│ │ ├── direct_lingam_causal_order_algorithm_threshold.py
│ │ └── para_lingam_causal_order_algorithm.py
│ └── original
│ └── direct_lingam_causal_order_algorithm.py
├── end_to_end
│ ├── end_to_end_result.py
│ ├── generic_end_to_end_algorithm.py
│ ├── new
│ │ ├── generic_algorithm_with_causal_graph_recovery_from_causal_order_.py
│ │ └── with/ ...
│ └── original
│ └── direct_lingam_end_to_end_algorithm.py
└── generic_algorithm.py
generic_algorithm.py: An abstract base class that defines the common interface for all algorithms, including methods for running the algorithm and handling results.
causal_order/: Contains algorithms that only determine the causal order of variables.
generic_causal_order_algorithm.py: A generic class for causal order algorithms.original/direct_lingam_causal_order_algorithm.py: The standard implementation of the DirectLiNGAM causal ordering phase.new/: Contains variations of the causal order algorithm, such as adding nodes in batches.
end_to_end/: Contains algorithms that perform full causal graph discovery.
generic_end_to_end_algorithm.py: A generic class for end-to-end algorithms.original/direct_lingam_end_to_end_algorithm.py: An implementation that uses the lingam library to perform the full DirectLiNGAM analysis.new/generic_algorithm_with_causal_graph_recovery_from_causal_order_.py: Runs a two-step approach that first finds the causal order using a subclass of generic_causal_order_algorithm and then recovers the causal graph using methods from the Recover-Causal-Graph-from-Causal-Order dataset.
data
├── ground_truth_available
│ ├── Causal_River
│ │ ├── Bavaria
│ │ │ ├── causal_order.txt
│ │ │ ├── rivers_bavaria.p
│ │ │ ├── rivers_ts_bavaria.csv
│ │ │ ├── rivers_ts_bavaria_preprocessed.csv
│ │ │ └── summary_matrix.npy
│ │ ├── East Germany
│ │ │ ├── causal_order.txt
│ │ │ ├── rivers_east_germany.p
│ │ │ ├── rivers_ts_east_germany.csv
│ │ │ ├── rivers_ts_east_germany_preprocessed.csv
│ │ │ └── summary_matrix.npy
│ │ └── Flood
│ │ ├── causal_order.txt
│ │ ├── rivers_flood.p
│ │ ├── rivers_ts_flood.csv
│ │ ├── rivers_ts_flood_preprocessed.csv
│ │ ├── rivers_ts_flood_preprocessed_dates_removed.csv
│ │ └── summary_matrix.npy
│ └── IT_monitoring
│ ├── Antivirus_Activity
│ │ ├── causal_graph_label.png
│ │ ├── causal_order.txt
│ │ ├── preprocessed_1.csv
│ │ ├── preprocessed_2.csv
│ │ ├── structure.txt
│ │ └── summary_matrix.npy
│ ├── Middleware_oriented_message_Activity
│ │ ├── causal_graph_label.png
│ │ ├── causal_order.txt
│ │ ├── monitoring_metrics_1.csv
│ │ ├── monitoring_metrics_2.csv
│ │ ├── structure.txt
│ │ └── summary_matrix.npy
│ ├── Storm_Ingestion_Activity
│ │ ├── causal_graph_label.png
│ │ ├── causal_order.txt
│ │ ├── storm_data_normal.csv
│ │ ├── structure.txt
│ │ └── summary_matrix.npy
│ └── Web_Activity
│ ├── causal_graph_label.png
│ ├── causal_order.txt
│ ├── preprocessed_1.csv
│ ├── preprocessed_2.csv
│ ├── structure.txt
│ └── summary_matrix.npy
└── ground_truth_not_available
└── S&P500
├── sp500
│ ├── causal_order.txt
│ ├── model.pkl
│ ├── sp500.csv
│ └── summary_matrix.npy
└── sp500_5_columns
├── causal-graph
├── causal-graph.pdf
├── causal_order.txt
├── model.pkl
├── sp500_5_columns.xlsx
└── summary_matrix.npy
This directory contains the datasets. Each dataset has its own subfolder, which includes the raw data and the corresponding ground truth files. The repository includes:
ground_truth_available/
- IT Monitoring Data: Source: Case_Studies_of_Causal_Discovery
- CausalRiver Datasets: Source: CausalRivers. For the Bavaria and East Germany data you must complete step 1b.
ground_truth_not_available/
- S&P500 Data
external
├── causal_rivers/ ...
└── recover_causal_graph_from_causal_order/ ...
We have 2 submodules: Causal Rivers and Causal Graph Recovery from Causal Order Repository. To clone the submodules, run the following code snippet. This is done during installation step 1b.
git submodule update --init --recursiveresults
├── causal_order/ ...
└── end_to_end/ ...
This folder stores the outputs of analysis.
utils
├── compare_results.py
├── generate_data_when_ground_truth_not_available.py
├── generate_list_of_algorithms.py
├── print_results.py
├── run_all_algorithms_on_dataset.py
└── storage.py
This directory contains utility scripts.
compare_results.py A set of functions that compare two results.
generate_data_when_ground_truth_not_available.py This file loads data from a .csv or .xlsx file
It then runs DirectLiNGAM and save the results of this to the same directory as the input file.
generate_list_of_algorithms.py This file finds the algorithms within the algorithms/ directory and makes a list.
The output is written to the algorithms/ directory in algorithm_list.txt.
print_results.py Find the results within the algorithms/ directory and prints a summary of each one.
run_all_algorithms_on_dataset.py Run the ._get_and_save_result() method of all algorithms on the selected dataset
storage.py This module handles the serialisation of data and deserialisation of results.
- Ensure that the number of variables in your dataset matches the dimensions of the summary matrix.
- For large datasets (more than 15 variables), such as
CausalRiverFlood, visualising the full causal graph is not recommended as it can become cluttered and difficult to interpret. - Thank you to Zhao Tong (@jultrishyyy) for the Causal Graph Recovery from Causal Order Repository and its detailed README
- For any issues or questions, please open an issue on the repository's issue tracker.