-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ThermoML Archive dataset #118
base: main
Are you sure you want to change the base?
Conversation
31cf723
to
463947d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without speaking to the content, I worry about the maintainability of this module - please see inline comments for specifics mostly around repeating code snippets that may be implemented by a stdlib.
"description": "ThermoML is an XML-based IUPAC standard for the storage and exchange of experimental thermophysical and thermochemical property data. The ThermoML archive is a subset of Thermodynamics Research Center (TRC) data holdings corresponding to cooperation between NIST TRC and five journals.", # noqa | ||
"identifiers": [ | ||
{ | ||
"id": "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it ok to not provide an id value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is unfinished (as mentioned in PR desc), this is the main bit that needs feedback/further discussion.
Point taken, but I think you are misunderstanding the point of these transform scripts which are meant to be standalone and aren't really targeting reusability/maintainability. If the stdlib file digest can do chunked hashing then fine by me, it's not clear from the docs whether the whole file needs to be in memory (the final CSV is quite big so probably not desirable). |
In fact, file digest was only added in 3.11 so we can't use it here (this project is 3.8 only). Happy to refactor into a two line function though. |
I agree, If there is some code we can share across the scripts, great. But we don't envision those scripts to be used other than for collecting the initial dataset. |
I think this dataset is one example for which we might want to use our more "advanced" yaml templates, that specify prompts such as "what is the {speed of sound} for mixture of {component 1} and {component 2}". The documentation on how to specify this is currently not so great. I can help after the weekend with an example if you would like me to do so |
Sounds good to me -- I'm not in any rush with this, so it could wait until the dataset hackathon perhaps (unless you are already heavily committed on other stuff for that!) I should be able to attend for a couple of hours in the morning at least. I think there is a lot of data being left on the table with this current parsing approach -- happy to investigate further about customising the thermopyl parser usage and getting access to more of the (https://trc.nist.gov/ThermoML/)[listed properties] (see below). It may even be worth splitting into multiple datasets. |
@ml-evs, we have revised the prompt template syntax in the contribution guide - do you want to give it a shot? Otherwise, if you're overcommitted, I can also look at it. Just let me know |
I am definitely over-committed right now but don't just want to shift the burden to you... if you find time to look at it yourself then just write here, otherwise I will add it around the bottom of my to-do list again! |
This PR adds the ThermoML dataset as a flat csv file. I'm not sure if this is the most appropriate way of doing it yet, and its not entirely clear to me how to define the targets for this dataset!
The compressed archive is about 200 MB, around of XML and JSON files 4 GB when extracted. The data takes about an hour to parse on my machine (for ~10k files into around ~2.6m rows rows)
Here, we use @marocsfelt's
thermopyl
fork to extract the xml files into a dataframe. I imagine this will need an amount of additional processing to be useful in this project, but its not clear to me what that would involve yet (and probably needs to be done at the transform step per file, rather than cleaning of the final csv).Other matters arising:
Example rows for current state:
Closes #117