Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New data base checks #219

Open
issamali67 opened this issue Dec 11, 2020 · 6 comments
Open

New data base checks #219

issamali67 opened this issue Dec 11, 2020 · 6 comments
Assignees

Comments

@issamali67
Copy link

issamali67 commented Dec 11, 2020

New data base is finished. A total of 23591 molecules (few are single atom molecules will can be filtered). Need to randomly check some of the new molecules (see attached other-molecules.txt).
other-molecules.txt

@issamali67
Copy link
Author

Here is the current (legacy) BaM database.
legacy_other_molecules.txt

@ariel-phet
Copy link

@jonathanolson @issamali67 @arouinfar There seem to be a few issues with the new molecule list. Overall I think the list of new molecules is correct, but the new list has filtered out some legitimate compounds from the original list.

This excel file should provide an easy reference
BAM new molecules.xlsx

  • The .txt files given above have been imported to excel
  • The section gives the newly generated list of molecules
  • The second section is the list of molecules in the published sim
  • In the second section the molecules that are NOT duplicates between the old and updated lists appear first
  • The third section is only the old molecules not appearing in the new list

Giving some examples going down the list of about 1000 molecules that do not seem to be in the new list (by pubchem id)

  1. 6347 - Appears to be legitimate
  2. 7235 - appears to be a legacy record, and should not be included
  3. 10038 seems legitimate and should be included
  4. 11190 - not found under id or name, should not be included
  5. 11535 seems legitimate

My theory is that some of these legitimate ones like 11535 something like the following is occuring -- searching 11535 brings up the legitimate record for the "Compound CID" but brings up a molecule we would not include for the "Substance SID)

So some of the filtering appears to be correct (rejecting legacy records), and some seems to be incorrect (rejecting legitimate molecules that should be included in the updated list).

Maybe @issamali67 can trouble shoot his scripts, or maybe it would be useful for QA to go through the first 100 or so of these as above and see if any other patterns emerge.

@ariel-phet ariel-phet assigned arouinfar and issamali67 and unassigned ariel-phet Jan 29, 2021
@arouinfar
Copy link
Contributor

Thanks for creating the spreadsheet @ariel-phet! I spent ~15 minutes checking the first 50 rows in the "Molecules not in updated list from original list" section, see BAM.new.molecules-AR.xlsx. About 60% of the compounds appear to have been incorrectly eliminated. The remaining 40% were eliminated because it was a radical or the ID was a match for an irrelevant SID/PMID.

I found two borderline cases where the IDs matched CIDs of legitimate compounds. However, the name listed in the spreadsheet was not a synonym on the PubChem profile. I checked the spreadsheet for other instances of the chemical formula, but didn't find anything.

  • 138197 - Name in spreadsheet 2-(ketomethylene)malononitrile
  • 138256 - Name in spreadsheet is 3-ketoacrylonitrile

I think a good first step would be for @issamali67 to troubleshoot.

@arouinfar arouinfar removed their assignment Feb 1, 2021
@issamali67
Copy link
Author

I think you guys checked these molecules on pubchem search, and things can be different than what you will find in the SDF files, which are used for filtering. I checked the SDF files for the 1st 100 molecules appearing in the excel file (to include the ones @arouinfar looked at). These 100 divide into 3 categories:

(1) name not in SDF: A molecule whose information mostly exist in SDF but its name does not. 52 molecules in excel are like that. Since the current data base includes molecular names, my program filters out molecules that have no names in SDF.
(2) not in SDF: A molecule that does not exists in the SDF file. I found 25 such molecules. But most probably you will find them when you do a pubchem search!
(3) filtered out: these are filtered out by my program and I will investigate why this is so. I found 23 molecules in this category.

Will update later on category 3.

@issamali67
Copy link
Author

Found the problem. From the molecules that are filtered out (highlighted in yellow in the attached excel):
(1) there are no buckets in the current BaM sim for 3 molecules. For example, the sim does not allow to have molecules of sulfur and oxygen or boron and fluorine (see column with header "comments 2" in the attached excel). May be in older BaM version this was allowed?
(2) the remaining filtered out molecules: I do 2 stages of filtration. The result of the first stage is a "first-pass". All these molecules highlighted in yellow (minus the ones above) are present in the first-pass but are filtered out in the second filtration stage. This is because if there are duplicates of a molecule, then my code takes the one with the smallest CID (I order molecules by CID), and the smallest CID may be, for example, an isotope...etc.

Will correct this and regenerate another database.
BAM.new.molecules-AR-IH.xlsx

@issamali67
Copy link
Author

I fixed the problem with the filtering code. Still get around 23K molecules in the end. Many of these molecules are created after the first database came out in 2011. Attached is the excel file (contains names, formula and CID only) for checking (I already did some random ones and looks to be fine). Most of these molecules have 3d info (currently downloading this info). Few are only 2d.
new_filtered_2d.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants