Trennen is German for separate
-   Use machine learning to find out if we can do a good job predicting the angle of optical activity for a given enantiomer.
- Determine factors which affect optical activity.
 - [ ]
 
 -   Use machine learning to predict the EE% of a reaction
- Determine factors of enantioselectivity.
 - [ ]
 
 
Input: solvent, reactants, and catalysts as positions in 3d space
Output: EE%
- First, I downloaded the QM9 dataset.
 - This includes about 130K different organic molecules with their xyz coordinates, smiles, and inchi.
 - That's cool.
 - But some of these organic molecules in their SMILES form did not have any stereochemistry.
 - We need molecules with stereochemistry because all planar molecules (2d molecules) can be flipped in 3d space to undo a reflection.
 - Generally speaking, a n dimensional figure is achiral in n+1 dimensions (I postulate).
 - However, an achiral molecule in n+1 dimensions is not necessary to make it chiral in n dimensions.
 - We need stereochemistry in optical activity so I simply added all files with stereochemistry as follows 
find . -type f -exec grep -F '@' {} \; -exec mv -t files\_with\_stereochemistry/ {} +. - This is because the SMILES format uses the 
@symbol to denote stereochemistry. - OK.
 - That's cool.
 - But we don't know if all of these molecules have chiral centers.
 - Remember, our goal is to filter all these molecules to just be enantiomers.
 - To find the chiral centers and filter them into a new directory, we can use the RDkit python chemistry tooling library.
 - So I wrote a simple python script titled "find_files_with_chiral_centers.py".
 - After executing it (took about 20 minuteS), we have a new directory with ~97K molecules which contain chiral centers.
 - OK.
 - That's cool.
 - But we need only molecules which are chiral.
 - As a sidenote, I do know that diasteoremers are sometimes optically active but for the purpose of this project, we are considering only enantiomers.
 - At that time of doing this project, I was learning group theory and how I might possibly determine if a molecule is chiral or not.
 - I had checked many places online but I couldn't really find a definitive explanation.
 - However, I had received a reply from ChirBase allowing me to sample their database which contained about 13K chiral compounds (the full database contains over 300K compounds).
 - So I ran with the idea and began by exporting the database as an excel worksheet in the smiles format (thankfully the isomeric smiles was included).
 - I then removed the excess junk and create a single txt file with all the smiles compounds in a directory called chirbase_chiral_molecules named "CHIRBASE_SEPARATION.txt" (not included).
 - However, this contained duplicate smiles in the list because it had different information based on the researcher.
 - Therefore, a simple script was written to remove all the duplicates named "remove_duplicate_entries.py".
 - After running it, the new file was named "CHIRBASE_SEPARATION_UPDATED.txt".
 - Now, the final step in setting up the data was to retrieve the optical activity for each of the chiral compounds.
 - To do this, I broke it up in two steps.
 - First, we would retrieve and determine if the compound exists on Chemsp***r.
 - Since Chemsp***r redirects their links immediately to the coumpound if it is found, a simple script was written to automatically determine the redirect link and retrieve it into a file named "get_chemsp***r_links.py".
 - This script took about 12 hours to execute since there was a 1 second delay included in the script to prevent overloading the chemsp***r servers as well as to prevent the chemsp***r overlords banning my IP :)
 - OK.
 - That's cool.
 - We have a list of all the links to chiral compounds with their respective chiral molecules.
 - By the way, the actual links file was cleaned up to remove the "no redirects" and links which did not get sent to an actual molecule name.
 - In total, we have 6K chiral compounds which has a valid chemsp***r link.
 - For those interested, the commands in vim were 
:%s/^no redirect\n//gfollowed by:%s/^.*@.*$\n//gfollowed by:%s/^.*C\/.*\n//gfollowed by:%s/^.*C(.*\n//gfollowed by:%s/^.*C=.*\n//gfollowed by%:^s/^.*=O.*\n//gfollowed by:%s/?rid.*$//gfollowed by:%s/b'//g. - WAIT
 - I just shot myself in the foot.
 - I executed all the find and replace and removed all the "no redirects" but now I don't know the smiles format for the structures.
 - RIP.
 - I guess I have to run this again to determine exact smiles structures . . .
 - Next time, I should just leave a blank line or a line with a specific character (such as #) to specify that it is a placeholder for an invalid link.
 - However, we can run this again in conjunction with part 2 which is to actually retrieve the optical rotation direction.
 - This can simply be done by extracting the title page or synonym of the respective chemsp***r link since it is included in the molecules name.
 - To do this, get_chemspider_link.py was completely rewritten.
 - The end result should be that the CHIRBASE_SEPARATION_LINKS.txt should be in sync with the CHIRBASE_SEPARATION_UPDATED.txt such that the smiles and links correspond if a valid molecule exists.
 - Additionally, the CHIRBASE_SEPARATION_DIRECTION.txt file should include a list of arrays with the respective smile, url, and optical direction.
 - As I am working on this file, I just realized that we don't even need the chirbase database. We just need a large set of molecules and simply check if it contains the (-) or (+) indicator in the title to determine its chirality.
 - Since this script checks the redirect link as well as retrieving the link, the script took about [blank] hours to execute.
 - After all of this, we finally had a list of chiral molecules with their optical rotation direction.
 - Depending on how much data we are able to extract form these molecules, we have two options.
 - If we have a lot of data (relatively), we will begin writing on our machine learning model.
 - If we do not have a lot of data, we should ideally find a larger dataset with more organic molecules and run all of the above steps again.
 - In either case, we have one more necessary step for our data science part.
 - We need to artifically generate the stereoisomer of each molecule if it is not present.
 - And after considering this, I believe it would make the most sense to generate these molecules before running the above script and check them on chemspider.
 - This is because I don't see a trivial way of generating the chiral enantiomers with multiple chiral centers.
 - Furthermore, it appears that some data in the "chirbase" database does not contain only chiral molecules.
 - For example, dichloromethane appears in the data . . .
 - Also, it appears that the chirbase database is too small.
 - So . . . we are going to transition our work to the
 - OK.
 - So first I moved all the files in files_With_chiral_centers into a subdirectory named files.
 - Then I copied the files/ directory into files_with_optical_rotation directory.
 - Then I began working on the script in the directory.
 - It seems like that it would be easier to first create a giant file with the list of smiles as well as their stereoisomers to be searched in chemsp***r.
 - OK.
 - LET's DO THIS
 - BRUH
 - ok
 - Just like undo the past 50 lines or something.
 - I was reading something online from Jun 2000!
 - And they said you could just simply reflect the mol file across the origin to get the enantiomer.
 - So yeah.
 - From that we have determine the chirality of a molecule.
 - So I basically wrote two functions and made a pull request with rdkit.
 - So yeah lol.
 - We're just going to use the QM9 dataset (isomeric smiles format) and filter out only chiral molecules.
 - Then we'll generate the enantiomers and write a smart function to figure out if an enantiomer is missing on chemsp***r to use the opposite direction of the other enantiomer.
 - OK, just generated all the enantiomers of molecules with chirality in the QM9 dataset.
 - This means we have a big list of chiral molecules (enantiomers)!!!
 - Took about 60 minutes to execute (find_files_with_chirality.py).
 - UPDATE: Seems like there is going to be a lot of enantiomers!! At 27%, we already had 40K enantiomers! More Data = Better chances at beating MIT
 - Now, we just get the relevant optical rotation value from chemsp***r.
 - Fortunately, we already wrote a script to do that!
 - Ok so after not working on this for two weeks, here is my progress: All is vanity. Everything in the useless/data/ folder is vanity. Waste of time. Completely.
 - At least I learned a lot though. Even got a PR on rdkit. Anyways. . .
 - So basically chemsp***er is pretty bad since (1) its slow (2) IT GAVE LIKE 100 OPTICAL ROTATION VALUES AFTER RUNNING IT FOR OVER 9000 compounds.
 - That's like a 1% extraction rate and we'll never be able to compete with the 70K molecules ChiRO used.
 - We're going to use pubchem.
 - And after searching, I came across this article: https://www.ncbi.nlm.nih.gov/Class/PubChem/essentials/limits.html
 - Basically, you can obtain all compounds in the pubchem database by their chirality and . . . now we have a dataset of ~17 million chiral compounds (YOO).
 - By choosing the export type as the synonyms, we can simply search for molecules with the (+) or (-) synonym and extract the CID number. Then, we use the CID number to obtain the isomeric smiles/mol file.
 - Let's go.
 - GG.
 - Ok so apparently there was a download fail and it only downloaded 4 million compounds out of the possible 18 million compounds.
 - But.... good news
 - On our computer now (4million.txt), we have approximately 15 thousand compounds labeled with their (-) or (+) indicator, without artifically generating the enantiomer.
 - Not bad.
 - We'll retry the download to see if we can get all 18 million compounds.
 - Alright, so I made a video essentially explaining what I did.
 - The download was incredibly slow and a terrible process.
 - So I used the esearch api to repeat the search I did on the ncbi site and got all the CIDs in the CIDs.txt file.
 - Then I PUGrest to retrieve all the synonyms of the CIDs.
 - Since doing individual ones was slow, I wrote a function pubchem.py which essentially sends a post request with a bunch of CIDs separated by commas.
 - Then, I retrieved the smiles and placed them in a file.
 - After some ReGeX magic, we got to the files ilovesmiles.txt and ilovejson.txt.
 - Note that the data in ilovedata.txt is NOT all chiral since it also includes compounds with lines such as (CH+).
 - The generate_enantiomers.py script sorts these compounds and creates a new file with enantiomers and only chiral compounds.