-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auditing OneHotEncodingTransformer #178
Comments
suggestions to improve Fitting time can be improved, the objective of def fit(self, data):
data = self._prepare_data(data)
data[pd.isnull(data)] = np.nan
self.dummy_na = pd.isnull(data).any()
self.dummies = list(pd.unique(data)) the issue with the implementation is that the assignment in in this case, the time it takes
|
work around to remove the assignment operator def fit(self, data):
data = self._prepare_data(data)
self.dummy_na = pd.isnull(data).any()
self.dummies = list(pd.unique(data[~pd.isnull(data)]))
if self.dummy_na:
self.dummies.append(np.nan) in this case, the time it takes
|
suggestions to improve some ideas to see if we can use def transform(self, data):
data = self._prepare_data(data)
data[pd.isnull(data)] = np.nan
elements = np.vectorize(self.dummies.index)(data)
array = np.eye(len(self.dummies))[elements].astype(int)
for i, row in enumerate(array):
if np.all(row == 0) and self.error_on_unknown:
raise ValueError(f'The value {data[i]} was not seen during the fit stage.')
return array the problem with this implementation is that (1) it is not stable; we cannot use
|
Very nice report @sarahmish ! A few suggestions I would make are:
|
Here is another suggestion: What about using This basically takes a 1d array and replicates it the indicated number of times to make a 2d matrix with that array as rows.
Afterwards you do an element wise comparison and get Trues where the category matches. def get_dummies(data, categories):
num_rows = len(data)
num_categories = len(categories)
dummies = np.broadcast_to(categories, (num_rows, num_categories))
data = np.broadcast_to(data, (num_categories, num_rows)).T
return (data == dummies).astype(int) The result is a function that is much faster than WRT the nulls, you could do the same thing as in the |
@csala based on your suggestion I sketched out a new implementation for def fit(self, data):
data = self._prepare_data(data)
null = pd.isnull(data)
self.dummy_na = null.any()
self.dummies = list(pd.unique(data[~null]))
if self.dummy_na:
self.dummies.append(np.nan)
self.num_dummies = len(self.dummies)
def transform(self, data):
data = self._prepare_data(data)
num_rows = len(data)
dummies = np.broadcast_to(self.dummies, (num_rows, self.num_dummies))
coded = np.broadcast_to(data, (self.num_dummies, num_rows)).T
array = (coded == dummies).astype(int)
if self.dummy_na:
null = pd.isnull(data)
num_nulls = sum(null)
null_code = np.zeros(self.num_dummies)
null_code[-1] = 1
array[null] = np.broadcast_to(null_code, (num_nulls, self.num_dummies))
for i, row in enumerate(array):
if np.all(row == 0) and self.error_on_unknown:
raise ValueError(f'The value {data[i]} was not seen during the fit stage.')
return array I made dealing with I tested the transformers using
here I can see that there is ~43% speedup in the
in this case, I definitely see that the |
I tested this version of as the data gets larger, the numpy implementation gets a lot slower and pandas becomes a better option |
one thing I realized from the test above, when the size of input is large the search for row is what is taking a lot of time, which makes sense because we are iterating row by row. can we go from RDT/rdt/transformers/categorical.py Lines 248 to 250 in 05b4036
to something like unknown = array.sum(axis=1) == 0
if self.error_on_unknown and unknown.any():
raise ValueError(f'Attempted to transform {list(data[unknown])} that were not seen during fit stage.') this is the cause for the extremely long runtime seen in the test above for an input of size |
Stress testing the difference in
Notes
Recommendation
|
Excellent report @sarahmish ! With regards to the recommendations, here's one thing to consider: the recursive nature of SDV relational models (HMA1 in particular) may provoke a lot calls to So I would suggest to consider these two options:
|
I'm curious about this. Why do they fail? Memory Error? |
Yeah, colab runs out of memory and the kernel dies. |
based on this suggestion, I experimented with some different input sizes on the Random Categories with [ starting at |
More analysis was done and based on it we have the following recommendation. Some rules to follow:
the changes significantly increase the speed of the current implementation fitRDT/rdt/transformers/categorical.py Line 295 in ca639dc
transformRDT/rdt/transformers/categorical.py Line 308 in ca639dc
|
We audit time and memory performance of
OneHotEncodingTransformer
fitting
most of the time is consumed by
RDT/rdt/transformers/categorical.py
Line 233 in efddacf
transform
similarly most of the time is consumed by
RDT/rdt/transformers/categorical.py
Line 246 in efddacf
reverse transform
mapping the index back to the original map is what takes most of the time (but not that much)
RDT/rdt/transformers/categorical.py
Line 268 in efddacf
The text was updated successfully, but these errors were encountered: