-
Notifications
You must be signed in to change notification settings - Fork 208
Description
I've run into an issue trying to featurize a dataframe containing ~1000 band structures when using multiprocessing. I've confirmed that this is not an issue relating to any specific featurizer by creating a dummy featurizer that doesn't do anything. The issue seems to be that we are trying to pack too much data either through pickle or some other part of the multiprocessing library. I think the issue is exacerbated as I'm using uniform band structures, some of which are likely quite large.
If I run the featurization in serial mode I don't see the error.
Example code
from matminer.data_retrieval.retrieve_MP import MPDataRetrieval
from matminer.featurizers.base import BaseFeaturizer
class DummyBSFeaturizer(BaseFeaturizer):
def featurize(self, bs):
return 1
def feature_labels(self):
return ["dummy_features"]
mpdr = MPDataRetrieval()
criteria = {
"band_structure": {"$exists": True},
"elements": {"$nin": ["H"]},
"band_gap": {"$gt": 0, "$lte": 2},
"e_above_hull": {"$lte": 0.20},
"nsites": {"$gte": 1, "$lte": 80},
"icsd_ids": {"$exists": True, "$ne": []},
"has_bandstructure": True
}
# get a dataframe of ~5000 band structures. Note, this takes a LONG time.
df = mpdr.get_dataframe(criteria=criteria, properties=['bandstructure_uniform'])
dbsf = DummyBSFeaturizer()
df = dbsf.featurize_dataframe(df[2000:3000], "bandstructure_uniform")Error traceback
error Traceback (most recent call last)
<ipython-input-55-06ffb3880a7c> in <module>
1 dbsf = DummyBSFeaturizer()
----> 2 df = dbsf.featurize_dataframe(df[2000:3000], "bandstructure_uniform")
~/dev/src/matminer/matminer/featurizers/base.py in featurize_dataframe(self, df, col_id, ignore_errors, return_errors, inplace, multiindex, pbar)
338 ignore_errors=ignore_errors,
339 return_errors=return_errors,
--> 340 pbar=pbar)
341
342 # Make sure the dataframe can handle multiindices
~/dev/src/matminer/matminer/featurizers/base.py in featurize_many(self, entries, ignore_errors, return_errors, pbar)
465 return_errors=return_errors,
466 ignore_errors=ignore_errors)
--> 467 return p.map(func, entries, chunksize=self.chunksize)
468
469 def featurize_wrapper(self, x, return_errors=False, ignore_errors=False):
~/miniconda3/envs/common_env3/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
286 in a list that is returned.
287 '''
--> 288 return self._map_async(func, iterable, mapstar, chunksize).get()
289
290 def starmap(self, func, iterable, chunksize=None):
~/miniconda3/envs/common_env3/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
668 return self._value
669 else:
--> 670 raise self._value
671
672 def _set(self, i, obj):
~/miniconda3/envs/common_env3/lib/python3.6/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
448 break
449 try:
--> 450 put(task)
451 except Exception as e:
452 job, idx = task[:2]
~/miniconda3/envs/common_env3/lib/python3.6/multiprocessing/connection.py in send(self, obj)
204 self._check_closed()
205 self._check_writable()
--> 206 self._send_bytes(_ForkingPickler.dumps(obj))
207
208 def recv_bytes(self, maxlength=None):
~/miniconda3/envs/common_env3/lib/python3.6/multiprocessing/connection.py in _send_bytes(self, buf)
391 n = len(buf)
392 # For wire compatibility with 3.2 and lower
--> 393 header = struct.pack("!i", n)
394 if n > 16384:
395 # The payload is large so Nagle's algorithm won't be triggered
error: 'i' format requires -2147483648 <= number <= 2147483647
Possible fixes
Short of manually remapping the band structure numpy data to shared memory, I can't think of an easy fix for this.
As this error isn't raised when featurization is done in serial mode, perhaps a short term solution would be to set the default value of n_jobs to 1 for all band structure featurizers?