Add ThermoML Archive dataset #118

ml-evs · 2023-03-16T19:09:35Z

This PR adds the ThermoML dataset as a flat csv file. I'm not sure if this is the most appropriate way of doing it yet, and its not entirely clear to me how to define the targets for this dataset!

The compressed archive is about 200 MB, around of XML and JSON files 4 GB when extracted. The data takes about an hour to parse on my machine (for ~10k files into around ~2.6m rows rows)

Here, we use @marocsfelt's thermopyl fork to extract the xml files into a dataframe. I imagine this will need an amount of additional processing to be useful in this project, but its not clear to me what that would involve yet (and probably needs to be done at the transform step per file, rather than cleaning of the final csv).

Other matters arising:

This dataset has a requirements file associated with it for the transform step. I don't think it is worth adding these to the project level deps. In the end, there was no issue with old deps (thanks to using the fork).

Example rows for current state:

,filename,components,component_1_inchi,"Pressure, kPa","Temperature, K","Speed of sound, m/s",phase,"Speed of sound, m/s_std",component_2_inchi,"Mass density, kg/m3","Mass density, kg/m3_std",component_3_inchi,component_4_inchi,Mole fraction Variable metadata,Mole fraction
0,10.1007/s10765-010-0874-x.xml,hexan-1-ol,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,293.15,1320.11,Liquid,,,,,,,,
1,10.1007/s10765-010-0874-x.xml,hexan-1-ol,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,298.15,1303.19,Liquid,,,,,,,,
2,10.1007/s10765-010-0874-x.xml,hexan-1-ol,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,293.15,,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",818.856,,,,,
3,10.1007/s10765-010-0874-x.xml,hexan-1-ol,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,298.15,,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",815.281,,,,,
4,10.1007/s10765-010-0874-x.xml,"1,2-dichloroethane","InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,293.15,1212.92,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",,,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,,,
5,10.1007/s10765-010-0874-x.xml,"1,2-dichloroethane","InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,298.15,1193.6,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",,,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,,,
6,10.1007/s10765-010-0874-x.xml,"1,2-dichloroethane","InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,293.15,,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",1252.766,,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,,
7,10.1007/s10765-010-0874-x.xml,"1,2-dichloroethane","InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,298.15,,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",1245.555,,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,,
8,10.1007/s10765-010-0874-x.xml,"1,2-dibromoethane","InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,293.15,1007.04,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",,,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,,

Closes #117

ekalosak

Without speaking to the content, I worry about the maintainability of this module - please see inline comments for specifics mostly around repeating code snippets that may be implemented by a stdlib.

data/thermoml_archive/transform.py

ekalosak · 2023-03-16T21:01:22Z

data/thermoml_archive/transform.py

+            "description": "ThermoML is an XML-based IUPAC standard for the storage and exchange of experimental thermophysical and thermochemical property data. The ThermoML archive is a subset of Thermodynamics Research Center (TRC) data holdings corresponding to cooperation between NIST TRC and five journals.",  # noqa
+            "identifiers": [
+                {
+                    "id": "",


is it ok to not provide an id value?

No, this is unfinished (as mentioned in PR desc), this is the main bit that needs feedback/further discussion.

ml-evs · 2023-03-16T22:26:50Z

Without speaking to the content, I worry about the maintainability of this module - please see inline comments for specifics mostly around repeating code snippets that may be implemented by a stdlib.

Point taken, but I think you are misunderstanding the point of these transform scripts which are meant to be standalone and aren't really targeting reusability/maintainability.

If the stdlib file digest can do chunked hashing then fine by me, it's not clear from the docs whether the whole file needs to be in memory (the final CSV is quite big so probably not desirable).

ml-evs · 2023-03-16T22:28:03Z

In fact, file digest was only added in 3.11 so we can't use it here (this project is 3.8 only). Happy to refactor into a two line function though.

kjappelbaum · 2023-03-17T08:26:37Z

Point taken, but I think you are misunderstanding the point of these transform scripts which are meant to be standalone and aren't really targeting reusability/maintainability.

I agree, If there is some code we can share across the scripts, great. But we don't envision those scripts to be used other than for collecting the initial dataset.

kjappelbaum · 2023-03-17T08:29:13Z

I think this dataset is one example for which we might want to use our more "advanced" yaml templates, that specify prompts such as "what is the {speed of sound} for mixture of {component 1} and {component 2}".

The documentation on how to specify this is currently not so great. I can help after the weekend with an example if you would like me to do so

ml-evs · 2023-03-17T10:23:36Z

I think this dataset is one example for which we might want to use our more "advanced" yaml templates, that specify prompts such as "what is the {speed of sound} for mixture of {component 1} and {component 2}".

The documentation on how to specify this is currently not so great. I can help after the weekend with an example if you would like me to do so

Sounds good to me -- I'm not in any rush with this, so it could wait until the dataset hackathon perhaps (unless you are already heavily committed on other stuff for that!) I should be able to attend for a couple of hours in the morning at least.

I think there is a lot of data being left on the table with this current parsing approach -- happy to investigate further about customising the thermopyl parser usage and getting access to more of the (https://trc.nist.gov/ThermoML/)[listed properties] (see below). It may even be worth splitting into multiple datasets.

kjappelbaum · 2023-05-05T11:28:31Z

@ml-evs, we have revised the prompt template syntax in the contribution guide - do you want to give it a shot?

Otherwise, if you're overcommitted, I can also look at it. Just let me know

ml-evs · 2023-05-05T11:58:56Z

I am definitely over-committed right now but don't just want to shift the burden to you... if you find time to look at it yourself then just write here, otherwise I will add it around the bottom of my to-do list again!

Initial ThermoML transform script

463947d

ml-evs force-pushed the ml-evs/add-thermoml branch from 31cf723 to 463947d Compare March 16, 2023 19:13

Update docstring

21ec32f

ml-evs changed the title ~~Initial ThermoML transform script~~ Add ThermoML Archive dataset Mar 16, 2023

ml-evs marked this pull request as ready for review March 16, 2023 19:15

ekalosak suggested changes Mar 16, 2023

View reviewed changes

ml-evs added 2 commits March 16, 2023 22:49

Wrap sha256 calculator in function and add final csv checksum

0e0bb71

Add generated meta.yaml [wip]

b96adc3

phalem mentioned this pull request Mar 17, 2023

Dataset TODO list #75

Open

kjappelbaum added the help wanted Extra attention is needed label Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ThermoML Archive dataset #118

Add ThermoML Archive dataset #118

ml-evs commented Mar 16, 2023 •

edited

Loading

ekalosak left a comment

ekalosak Mar 16, 2023

ml-evs Mar 16, 2023

ml-evs commented Mar 16, 2023 •

edited

Loading

ml-evs commented Mar 16, 2023

kjappelbaum commented Mar 17, 2023

kjappelbaum commented Mar 17, 2023

ml-evs commented Mar 17, 2023

kjappelbaum commented May 5, 2023

ml-evs commented May 5, 2023

Add ThermoML Archive dataset #118

Are you sure you want to change the base?

Add ThermoML Archive dataset #118

Conversation

ml-evs commented Mar 16, 2023 • edited Loading

ekalosak left a comment

Choose a reason for hiding this comment

ekalosak Mar 16, 2023

Choose a reason for hiding this comment

ml-evs Mar 16, 2023

Choose a reason for hiding this comment

ml-evs commented Mar 16, 2023 • edited Loading

ml-evs commented Mar 16, 2023

kjappelbaum commented Mar 17, 2023

kjappelbaum commented Mar 17, 2023

ml-evs commented Mar 17, 2023

kjappelbaum commented May 5, 2023

ml-evs commented May 5, 2023

ml-evs commented Mar 16, 2023 •

edited

Loading

ml-evs commented Mar 16, 2023 •

edited

Loading