Skip to content

Multiprocessing error when featurizing many band structure objects #417

@utf

Description

@utf

I've run into an issue trying to featurize a dataframe containing ~1000 band structures when using multiprocessing. I've confirmed that this is not an issue relating to any specific featurizer by creating a dummy featurizer that doesn't do anything. The issue seems to be that we are trying to pack too much data either through pickle or some other part of the multiprocessing library. I think the issue is exacerbated as I'm using uniform band structures, some of which are likely quite large.

If I run the featurization in serial mode I don't see the error.

Example code

from matminer.data_retrieval.retrieve_MP import MPDataRetrieval
from matminer.featurizers.base import BaseFeaturizer

class DummyBSFeaturizer(BaseFeaturizer):

    def featurize(self, bs):
        return 1

    def feature_labels(self):
        return ["dummy_features"]
    
mpdr = MPDataRetrieval()

criteria = {
    "band_structure": {"$exists": True},
    "elements": {"$nin": ["H"]},
    "band_gap": {"$gt": 0, "$lte": 2},
    "e_above_hull": {"$lte": 0.20},
    "nsites": {"$gte": 1, "$lte": 80},
    "icsd_ids": {"$exists": True, "$ne": []},
    "has_bandstructure": True
}

# get a dataframe of ~5000 band structures. Note, this takes a LONG time.
df = mpdr.get_dataframe(criteria=criteria, properties=['bandstructure_uniform'])

dbsf = DummyBSFeaturizer()
df = dbsf.featurize_dataframe(df[2000:3000], "bandstructure_uniform")

Error traceback

error                                     Traceback (most recent call last)
<ipython-input-55-06ffb3880a7c> in <module>
      1 dbsf = DummyBSFeaturizer()
----> 2 df = dbsf.featurize_dataframe(df[2000:3000], "bandstructure_uniform")

~/dev/src/matminer/matminer/featurizers/base.py in featurize_dataframe(self, df, col_id, ignore_errors, return_errors, inplace, multiindex, pbar)
    338                                        ignore_errors=ignore_errors,
    339                                        return_errors=return_errors,
--> 340                                        pbar=pbar)
    341 
    342         # Make sure the dataframe can handle multiindices

~/dev/src/matminer/matminer/featurizers/base.py in featurize_many(self, entries, ignore_errors, return_errors, pbar)
    465                                return_errors=return_errors,
    466                                ignore_errors=ignore_errors)
--> 467                 return p.map(func, entries, chunksize=self.chunksize)
    468 
    469     def featurize_wrapper(self, x, return_errors=False, ignore_errors=False):

~/miniconda3/envs/common_env3/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    286         in a list that is returned.
    287         '''
--> 288         return self._map_async(func, iterable, mapstar, chunksize).get()
    289 
    290     def starmap(self, func, iterable, chunksize=None):

~/miniconda3/envs/common_env3/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
    668             return self._value
    669         else:
--> 670             raise self._value
    671 
    672     def _set(self, i, obj):

~/miniconda3/envs/common_env3/lib/python3.6/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
    448                         break
    449                     try:
--> 450                         put(task)
    451                     except Exception as e:
    452                         job, idx = task[:2]

~/miniconda3/envs/common_env3/lib/python3.6/multiprocessing/connection.py in send(self, obj)
    204         self._check_closed()
    205         self._check_writable()
--> 206         self._send_bytes(_ForkingPickler.dumps(obj))
    207 
    208     def recv_bytes(self, maxlength=None):

~/miniconda3/envs/common_env3/lib/python3.6/multiprocessing/connection.py in _send_bytes(self, buf)
    391         n = len(buf)
    392         # For wire compatibility with 3.2 and lower
--> 393         header = struct.pack("!i", n)
    394         if n > 16384:
    395             # The payload is large so Nagle's algorithm won't be triggered

error: 'i' format requires -2147483648 <= number <= 2147483647

Possible fixes

Short of manually remapping the band structure numpy data to shared memory, I can't think of an easy fix for this.

As this error isn't raised when featurization is done in serial mode, perhaps a short term solution would be to set the default value of n_jobs to 1 for all band structure featurizers?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions