Molstruct is a lightweight Python CLI tool that converts chemical molecule data Comma Separated Values (CSV) files to structured data formats - JSON-LD, RDFa, and Microdata. Molstruct has a lot of customization options that you can, but don't have to use. Python 3.2+ is supported and no dependencies are required. Sounds good so far? What would you say to a really tiny Molstruct Docker container? Just try Molstruct!
Structured data is additional data placed on websites. It is not visible to ordinary internet users but can be easily processed by machines. There are 3 formats that we can use to save structured data - JSON-LD, RDFa, and Microdata. Molstruct supports them all and uses the MolecularEntity profile.
There are many possibilities. The easiest way is to download a CSV file from one of the chemical databases, e.g. DrugBank. You can also create the CSV file yourself.
Use Molstruct in 3 easy steps. In this example, we will use the DrugBank open dataset. You need Python 3.2+ and pip installed.
- Open a terminal and install Molstruct
You can install the Molstruct from PyPI:
pip install molstructMolstruct is also available as a Docker image. In most cases, installing Molstruct from PyPI or using Docker should be sufficient and convenient, but you may want to run Molstruct from sources or build a Docker image yourself.
- Download DrugBank open dataset in CSV format and unzip downloaded archive.
- Molstruct has a predefined preset for this dataset. You just need to select the output format and enter the path to the CSV file. Assuming the
drugbank vocabulary.csvfile is in the current directory and the output format you're interested in is RDFa, the command will be as follows:
molstruct -p drugbank-open -f rdfa "drugbank vocabulary.csv" > drugbank_cc0_rdfa.htmlThat's all. Now you have the RDFa file ready in the current directory. You can try other output formats and options as described below. You can also use Molstruct to convert other data in CSV format.
If you have Docker installed, you can use a tiny Molstruct image from Docker Hub.
Because the tool is closed inside the container, you have to mount the local directory with your input file. The default working directory of the image is /app. You need to mount your local directory inside it (e.g. /app/input):
docker run --rm --name molstruct-app --mount type=bind,source=/home/user/input,target=/app/input,readonly lszeremeta/molstruct:latestIn this case, the local directory /home/user/input has been mounted under /app/input.
You can also simply mount the current working directory using $(pwd) sub-command:
docker run --rm --name molstruct-app --mount type=bind,source="$(pwd)",target=/app/input,readonly lszeremeta/molstruct:latestusage: molstruct [-h] [--version] -f {jsonldhtml,jsonld,rdfa,microdata} [-i IDENTIFIER]
[-n NAME] [-ink INCHIKEY] [-in INCHI] [-sm SMILES] [-u URL]
[-iu IUPACNAME] [-mf MOLECULARFORMULA] [-w MOLECULARWEIGHT]
[-mw MONOISOTOPICMOLECULARWEIGHT] [-d DESCRIPTION]
[-dd DISAMBIGUATINGDESCRIPTION] [-img IMAGE] [-an ALTERNATENAME]
[-sa SAMEAS] [-p {drugbank-open} | -c] [-s {iri,uuid,bnode}] [-b BASE]
[-vd VALUE_DELIMITER] [-l LIMIT]
file
Supported MolecularEntity properties that correspond to default CSV column names: identifier, name, inChIKey, inChI, smiles, url, iupacName, molecularFormula, molecularWeight, monoisotopicMolecularWeight, description, disambiguatingDescription, image, alternateName and sameAs. You can rename the columns if needed (see Column name change arguments below). You can also use a preset with the appropriate settings for your dataset.
-h,--helpshow help message and exit--versionshow program version and exit
-f {jsonldhtml,jsonld,rdfa,microdata},--format {jsonldhtml,jsonld,rdfa,microdata}output formatfileCSV file path with molecule data to convert
Remember about the appropriate file path when using the Docker image. Suppose you mounted your local directory /home/user/input under /app/input and the path to the CSV file you want to use in Molstruct is /home/user/input/file.csv. In this case, enter the path /app/input/file.csv or input/file.csv as file argument value.
Arguments for changing the default column names
-i IDENTIFIER,--identifier IDENTIFIERidentifier column name ('identifier' by default), Text-n NAME,--name NAMEname column name ('name' by default), Text-ink INCHIKEY,--inChIKey INCHIKEYinChIKey column name ('inChIKey' by default), Text-in INCHI,--inChI INCHIinChI column name ('inChI' by default), Text-sm SMILES,--smiles SMILESsmiles column name ('smiles' by default), Text-u URL,--url URLurl column name ('url' by default), URL-iu IUPACNAME,--iupacName IUPACNAMEiupacName column name ('iupacName' by default), Text-mf MOLECULARFORMULA,--molecularFormula MOLECULARFORMULAmolecularFormula column name ('molecularFormula' by default), Text-w MOLECULARWEIGHT,--molecularWeight MOLECULARWEIGHTmolecularWeight column name ('molecularWeight' by default), Mass e.g. 0.01 mg)-mw MONOISOTOPICMOLECULARWEIGHT,--monoisotopicMolecularWeight MONOISOTOPICMOLECULARWEIGHTmonoisotopicMolecularWeight column name ('monoisotopicMolecularWeight' by default), Mass e.g. 0.01 mg-d DESCRIPTION,--description DESCRIPTIONdescription column name ('description' by default), Text-dd DISAMBIGUATINGDESCRIPTION,--disambiguatingDescription DISAMBIGUATINGDESCRIPTIONdisambiguatingDescription column name ('disambiguatingDescription' by default), Text-img IMAGE,--image IMAGEimage column name ('image' by default), URL-an ALTERNATENAME,--alternateName ALTERNATENAMEalternateName column name ('alternateName' by default), Text-sa SAMEAS,--sameAs SAMEASsameAs column name ('sameAs' by default), URL
-p {drugbank-open},--preset {drugbank-open}apply presets for individual CSV sources to avoid setting individual options manually ('drugbank-open')-c,--columnsuse only columns with renamed names; not available when using a preset-s {iri,uuid,bnode},--subject {iri,uuid,bnode}molecule subject type ('iri' by default)-b BASE,--base BASEmolecule subject base for 'iri' subject type ('http://example.com/molecule#entity' by default)-vd VALUE_DELIMITER,--value-delimiter VALUE_DELIMITERvalue delimiter (' | ' by default)-l LIMIT,--limit LIMITmaximum number of results (unlimited by default)
Available options may vary depending on the version. To display all available options with their descriptions use molstruct -h.
To make your work easier, Molstruct has built-in preset support. Thanks to this, you do not have to set everything manually, you just select the appropriate preset and it's ready. The presets are flexible. If you want to change, e.g. the column names selected for a preset, you can do so. At the moment you can use the DrugBank open preset. There are plans to add more in the future. Any suggestions are welcome!
Settings for the open DrugBank dataset in CSV file:
--value-delimiteris set to ' | '--identifieris set to 'CAS'--nameis set to 'Common name'--inChIKeyis set to 'Standard InChI Key'--alternateNameis set to 'Synonyms'
molstruct -f jsonldhtml data.csvReturns simple HTML with added JSON-LD. Assumes that the column names in the CSV file are the default ones.
molstruct -f microdata -mf "formula" data.csvReturns simple HTML with added Microdata. Assumes that the column names in CSV file are the default ones but replaces default molecularformula column name by formula.
molstruct -f microdata --columns --id "CAS" --name "Common name" --inChIKey "Standard InChI Key" --limit 50 "drugbank vocabulary.csv"Returns simple HTML with added Microdata. When generating a file, only selected columns will be taken into account. A limit of 50 molecules has been specified.
molstruct -f microdata --columns --id "CAS" --name "Common name" --inChIKey "Standard InChI Key" --limit 50 "drugbank vocabulary.csv" > output.htmlDoes the same as the example above but saves results to output.html.
docker run --rm --name molstruct-app --mount type=bind,source=/home/user/input,target=/app/input,readonly lszeremeta/molstruct:latest -f microdata --columns --id "CAS" --name "Common name" --inChIKey "Standard InChI Key" --limit 50 "input/drugbank vocabulary.csv" > output.htmlDoes the same as the example above (run from pre-built Docker image).
Returns simple HTML with added Microdata and redirects output to molecules.html file. Run from pre-build Docker image.
Would you like to improve this project? Great! We are waiting for your help and suggestions. If you are new to open source contributions, read How to Contribute to Open Source.
Distributed under MIT License.
These projects can also be useful:
