Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: split by atom presence/absence #167

Open
kevingreenman opened this issue Jan 6, 2024 · 3 comments
Open

[FEATURE]: split by atom presence/absence #167

kevingreenman opened this issue Jan 6, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@kevingreenman
Copy link

Related to #145, but I think different enough to be its own request. It could be useful in some cases to split molecules based on the presence/absence of an atom or atoms (e.g. put everything that has a F in the test set, put everything that has a halogen in the test set, etc.).

In this case, the user should probably be required to input the atom or atoms that they want excluded into the test set (whereas with a scaffold split and maybe with the functional group split suggested in #145, no user input would be necessary).

Alternatively, #145 could be done where the user inputs a set of functional groups that they want in the test set. In this case, this PR would just be a special case of that where the "functional group" is just a single atom.

@kevingreenman kevingreenman added the enhancement New feature or request label Jan 6, 2024
@JacksonBurns
Copy link
Owner

Hi @kevingreenman thanks for the suggestion! I like this idea, and it seems pretty straightforward to implement (i.e. just scan a smiles string for a character, don't even need a RDKit molecule). How do you envision this sampler interacting with the split size arguments (test size, validation size, training size). Would they be ignored?

@kspieks
Copy link
Collaborator

kspieks commented Jan 8, 2024

My opinion is that converting to an RDKit molecule is likely unnecessary for this splitter. I think scanning the SMILES string should be sufficient.

Regarding the split size arguments, I think there's many possible ideas here. I'll list some just to start brainstorming. Additional thoughts are always welcome :)

One idea could be to specify one atom type, identify all molecules with that atom, and then place half in val and half in test. Or perhaps the val and test size arguments could still be used to split these up with different sizes?

But what should the desired behavior be if multiple atom types are specified? Would one/some be restricted to val while the others be restricted to test? Or should all atom types be stratified across val and test?

@kevingreenman
Copy link
Author

When I've done this type of splitting manually before, I haven't used any split sizes as inputs. But I could see a variety of ways this could be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants