This repository contains the dataset assembled by the International Union of Crystallography (IUCr) in the late 1990s to research the use of powder X-ray diffraction (PXRD) data for quantitative phase analysis.
-
1.1. Dataset Contents
1.2. Data Format
1.3. License
1.4. Known Issues
-
2.2. Updating the Dataset
-
3.1. Managing Data
3.2. Releasing an Official Dataset Version
3.3. Dataset Conventions
This repository contains the dataset assembled by the International Union of Crystallography (IUCr) by the Commission on Powder Diffraction (CPD) for the Round Robin on Quantitative Phase Analysis. The original dataset and details about the CPD project are available directly from the project website (which is no longer actively maintained):
https://www.iucr.org/resources/commissions/powder-diffraction/projects/qarr
├── README.md <- this file
├── RELEASE-NOTES.md <- dataset release notes
├── DATASET-LICENSE <- license for data components of the dataset
├── DATASET-NOTICE <- copyright notices for third-party data included in
│ the dataset
├── SOFTWARE-LICENSE <- license for software components of the dataset
├── SOFTWARE-NOTICE <- copyright notice for the software components of the
│ dataset
├── Makefile <- Makefile containing useful shortcuts (`make` rules).
│ Use `make help` to show the list of available rules.
├── pyproject.toml <- Python project metadata file
├── poetry.lock <- Poetry lockfile
├── bin/ <- scripts and programs for managing the dataset
├── data/ <- directory containing data for dataset
│ ├── mixtures/ <- PXRD data for mixtures
│ ├── pure-compounds/ <- PXRD data for pure compounds
│ └── VERSION <- version of the latest official release of the dataset
├── docs/ <- dataset documentation
└── extras/ <- additional files and references that may be useful
for dataset maintenance
-
Data files with the
prn
extension contain diffractogram data stored in column format (without headers): 2-theta, counts. -
The
pure-compounds/structures.txt
file contains structure information for pure compounds (except for the pharmaceutical compoundspharm1gr
andpharm2gr
). -
The
mixtures/compositions.txt
file contains weight percentages for the samples associated with the diffractograms in themixtures
directory.
The data contained in this dataset is covered under the Creative Commons
Attribution 4.0 International Public License (included in the
DATASET-LICENSE
file). Licenses for third-party data
included with this dataset are contained in the DATASET-NOTICE
file.
The software contained in this repository is covered under the Apache License
2.0 (included in the SOFTWARE-LICENSE
file). The copyright
for the software is contained in the SOFTWARE-NOTICE
file.
- Structure information is unavailable for the pharmaceutical compounds
(i.e., PXRD data files
pharm1gr.prn
andpharm2gr.prn
).
The instructions provided below require DVC and FastDS to be installed.
To use the dataset a "read-only" manner (i.e., without maintenance code), import the dataset after initializing DVC in the working directory.
-
Initialize DVC.
$ cd /PATH/TO/PROJECT $ fds init $ fds commit "Initialize DVC"
In the example commands above,
/PATH/TO/PROJECT
should be replaced by the path to the directory that where the dataset will be used. -
Recommended. Enable auto staging for DVC-managed data.
$ dvc config core.autostage true $ fds commit "Enable DVC auto staging"
-
Import the dataset.
$ dvc import https://github.com/velexi-research/IUCr-CPD-Quantitative-Phase-Analysis-Dataset.git data -o LOCAL_PATH $ fds commit "Import 'IUCr CPD Quantitative Phase Analysis Dataset'"
In the example command above, the following substitutions should be made:
-
LOCAL_PATH
should be replaced by the local path relative to/PATH/TO/PROJECT
where the dataset should be placed. Note: the parent directory ofLOCAL_PATH
should be created before runningdvc import
. -
DATASET_NAME
should be replaced by the name of the imported dataset.
For example, if the dataset repository is located at
https://github.com/account/cool-dataset
and we would like to to place the dataset into a directory named
data/cool-dataset
, we would use the following command:$ mkdir data $ dvc import https://github.com/account/cool-dataset data -o data/cool-dataset $ fds commit "Import cool-dataset"
-
If a previously imported dataset has been updated, the local copy of the
dataset can be updated (to the latest version on the default branch of the
dataset Git repository) by using the dvc update
command.
$ dvc update DATASET.dvc
$ fds commit "Update 'IUCr CPD Quantitative Phase Analysis Dataset'"
or
$ dvc update DATASET
$ fds commit "Update 'IUCr CPD Quantitative Phase Analysis Dataset'"
In the example commands above, the following substitutions should be made:
DATASET.dvc
should be replaced by the.dvc
file that was generated when the dataset was imported (or, equivalently,DATASET
should be replaced by name of the directory that the dataset was imported into).
To specify the particular revision of the dataset to retreive, use the
--rev REVISION
option, where REVISION
is a Git tag, branch, or commit SHA/hash.
$ dvc update DATASET.dvc --rev REVISION
$ fds commit "Update 'IUCr CPD Quantitative Phase Analysis Dataset'"
-
Add the data files to the
data
directory. -
Add the contents of
data
to the data tracked by DVC.$ fds add data
-
Commit the dataset changes to the local Git repository.
$ fds commit "Add initial version of data"
-
Push the dataset changes to the remote Git repository and DVC remote storage.
$ fds push
-
Update the data files in the
data
directory. -
Update the data tracked by DVC with the new content of the
data
directory.$ fds add data
-
Commit the dataset changes to the local Git repository.
$ fds commit "Update dataset"
-
Push the dataset changes to the remote Git repository and DVC remote storage.
$ fds push
-
Remove the data files from the
data
directory. -
Update the data tracked by DVC with the new content of the
data
directory.$ fds add data
-
Commit the dataset changes to the local Git repository.
$ fds commit "Update dataset"
-
Push the dataset changes to the remote Git repository and DVC remote storage.
$ fds push
-
Make sure that the dataset has been updated (Section 3.1.2)
-
Update the
README.md
file. -
Increment the version number in
pyproject.toml
. -
Update
data/VERSION
.$ cd data $ poetry version -s > VERSION
-
Recommended. Update the release notes for the dataset to include any major changes from the previous released version of the dataset.
-
Create a tag for the release in git.
$ git tag `poetry version -s` $ git push --tags
-
Optional. If the Git repository for the dataset is hosted on GitHub (or analogous service), create a release associated with the git tag created in Step #4.
-
data
directory. All data files that should be imported when using thedvc import URL data -o LOCAL_PATH
command should be placed in thedata
directory.- Depending on the nature of the dataset, it may be useful to organize the data files into sub-directories (e.g., by type of data).
-
data/VERSION
file. Thedata/VERSION
file contains the version number of the latest official release of the dataset. It is generated automatically and should not be manually edited.
-
README.md
file. TheREADME.md
file should contain-
a high-level description of the dataset and
-
instructions for software tools used to create and maintain the dataset.
-
-
docs
directory. Thedocs
directory should be used for detailed documentation for the dataset (i.e., data and supporting software tools).
-
bin
directory. Thebin
directory should be used for supporting software tools (e.g., data capture and processing scripts) developed to help maintain the dataset. -
pyproject.toml
file. Python dependencies for supporting tools should be maintained in thepyproject.toml
file. Most of the time,poetry
utility will appropriately updatepyproject.toml
as dependencies are added or removed. -
extras
directory. Theextras
directory should be used for ancillary files (e.g.,direnv
configuration template, general reference documents for tools that are not dataset-specific).