Skip to content

RAM issue in MolToDescriptorPipelineElement when standardizer not None #23

Open
@JochenSiegWork

Description

I tried to process a data set of 1.4M molecules with a small Pipeline looking like this:

pipeline = Pipeline(
            [
                ("smi2mol", SmilesToMol()),
                ("net_charge_element", MolToNetCharge()),  # MolToNetCharge inherits from MolToDescriptorPipelineElement
            ])

This leads to RAM issues because Molpipeline simultaneously tries to fit the RDKit data structures for all 1.4M molecules into the RAM. This happens because Molpipeline splits the pipeline elements into syncing and non-syncing parts during the instance-based processing splitting.

In the constructor of MolToDescriptorPipelineElement, the _requires_fitting is set when the standardizer is not None:

  if self._standardizer is not None:
            self._requires_fitting = True

The RAM issues can be avoided by doing this:

pipeline = Pipeline(
            [
                ("smi2mol", SmilesToMol()),
                ("net_charge_element", MolToNetCharge(standardizer=None)),
            ])

It would be better to have the standardization in a way that does not lead to RAM issues.

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions