-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New data base checks #219
Comments
Here is the current (legacy) BaM database. |
@jonathanolson @issamali67 @arouinfar There seem to be a few issues with the new molecule list. Overall I think the list of new molecules is correct, but the new list has filtered out some legitimate compounds from the original list. This excel file should provide an easy reference
Giving some examples going down the list of about 1000 molecules that do not seem to be in the new list (by pubchem id)
My theory is that some of these legitimate ones like 11535 something like the following is occuring -- searching 11535 brings up the legitimate record for the "Compound CID" but brings up a molecule we would not include for the "Substance SID) So some of the filtering appears to be correct (rejecting legacy records), and some seems to be incorrect (rejecting legitimate molecules that should be included in the updated list). Maybe @issamali67 can trouble shoot his scripts, or maybe it would be useful for QA to go through the first 100 or so of these as above and see if any other patterns emerge. |
Thanks for creating the spreadsheet @ariel-phet! I spent ~15 minutes checking the first 50 rows in the "Molecules not in updated list from original list" section, see BAM.new.molecules-AR.xlsx. About 60% of the compounds appear to have been incorrectly eliminated. The remaining 40% were eliminated because it was a radical or the ID was a match for an irrelevant SID/PMID. I found two borderline cases where the IDs matched CIDs of legitimate compounds. However, the name listed in the spreadsheet was not a synonym on the PubChem profile. I checked the spreadsheet for other instances of the chemical formula, but didn't find anything.
I think a good first step would be for @issamali67 to troubleshoot. |
I think you guys checked these molecules on pubchem search, and things can be different than what you will find in the SDF files, which are used for filtering. I checked the SDF files for the 1st 100 molecules appearing in the excel file (to include the ones @arouinfar looked at). These 100 divide into 3 categories: (1) name not in SDF: A molecule whose information mostly exist in SDF but its name does not. 52 molecules in excel are like that. Since the current data base includes molecular names, my program filters out molecules that have no names in SDF. Will update later on category 3. |
Found the problem. From the molecules that are filtered out (highlighted in yellow in the attached excel): Will correct this and regenerate another database. |
I fixed the problem with the filtering code. Still get around 23K molecules in the end. Many of these molecules are created after the first database came out in 2011. Attached is the excel file (contains names, formula and CID only) for checking (I already did some random ones and looks to be fine). Most of these molecules have 3d info (currently downloading this info). Few are only 2d. |
New data base is finished. A total of 23591 molecules (few are single atom molecules will can be filtered). Need to randomly check some of the new molecules (see attached other-molecules.txt).
other-molecules.txt
The text was updated successfully, but these errors were encountered: