Data Driven Design as an Ecologically Valid Challenge Task for Document Understanding

This repo contains the code and metadata for the paper "Data Driven Design as an Ecologically Valid Challenge Task for Document Understanding".

Environment Setup

Install the dependencies for the repo as follows:

pip install -r requirements.txt

Metadata and Dataset Reconstruction

Because this dataset consists of non-open access academic papers, we are limited to providing the metadata for this dataset, i.e. the DOIs of the papers, and publishing our further annotated versions of the zeolite data originally published in Jensen et. al. (2019) and the aluminum data from Pfeiffer et. al. (2020). Our re-annotated versions of this data can be found in the respective folders in the data/ directory.

We additionally provide our EDA notebooks in this repository, as eda_zeolite.ipynb and eda_aluminum.ipynb, respectively.

To reconstruct the sections of the dataset that are available from publisher TDM APIs, we provide the construct_dataset.py script. This script pulls papers in publisher provided format from the APIs of three publishers: Elsevier, Wiley, and Springer, assuming that the user is licensed appropriately. To use this script, you must first get API keys for Elsevier here and Wiley here. The user must then specify these API keys, along with a mailto address for the polite usage tier of the Crossref API for Springer papers, in secrets.json:

{
    "ELSEVIER_API_KEY": "...",
    "WILEY_API_KEY": "...",
    "SPRINGER_MAILTO": "..."
}

Once the secrets file has been set up appropriately, you can run the construct dataset script:

python construct_dataset.py --dataset=<zeolite|aluminum> --from_scratch=<False|True>

This script will download XML from Elsevier, PDFs from Wiley, and both PDF and an HTML/XML representation from Springer, and place them in xml and pdf directories within the data directory for each dataset. The from_scratch parameter determines whether the script will fetch publisher metadata afresh (not recommended) or use the version already provided in data/<dataset>/publisher_metadata.csv. It will not replace files that have already been downloaded. Every file is stored in the format <doi>.<extension> where the DOI has had filesystem noncompliant characters, e.g. "/" replaced with underscores.

For models that do not accept PDFs, and require images of individual pages instead, we provide the generate_page_images.py script, which will segment PDFs into pages and write PNGs of those pages to individual files, in the format <doi>_<page_number>.<extension>.

Running Baselines

TBD.

We provide our baseline predictions in the predictions/ folder.

Evaluation

Once predictions have been generated, run the evaluation script with the following command:

python evaluate_and_report.py --dataset=<zeolite|aluminum> --modality=<pdf|xml> --predictions_path=...

This will evaluate the provided predictions against ground truth, and generate a json report with both final results and per-location results in a file with the same filename as the predictions, in the same directory. This script expects a predictions file in CSV format with a "doi" column, and additional columns corresponding to the columns to predict in the dataset.

Baseline Results

Regenerating figures from the paper.

Baseline figures TBD.

To regenerate the location visualization (figure 2 in the paper), run the cells in the location_visualization.ipynb notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
data		data
evaluation		evaluation
images		images
llm		llm
predictions		predictions
prompts		prompts
results		results
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compute_results_from_all_predictions.py		compute_results_from_all_predictions.py
construct_dataset.py		construct_dataset.py
eda_aluminum.ipynb		eda_aluminum.ipynb
eda_zeolite.ipynb		eda_zeolite.ipynb
evaluate_and_report.py		evaluate_and_report.py
generate_page_images.py		generate_page_images.py
generate_predictions.py		generate_predictions.py
requirements.txt		requirements.txt
secrets.json		secrets.json
split_openweight_predictions.ipynb		split_openweight_predictions.ipynb
viz_location.ipynb		viz_location.ipynb
viz_results.ipynb		viz_results.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Data Driven Design as an Ecologically Valid Challenge Task for Document Understanding

Environment Setup

Metadata and Dataset Reconstruction

Running Baselines

Evaluation

Baseline Results

Regenerating figures from the paper.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Uh oh!

License

Uh oh!

slab-cmu/ddd-benchmark

Folders and files

Latest commit

History

Repository files navigation

Data Driven Design as an Ecologically Valid Challenge Task for Document Understanding

Environment Setup

Metadata and Dataset Reconstruction

Running Baselines

Evaluation

Baseline Results

Regenerating figures from the paper.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages