Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auditing OneHotEncodingTransformer #178

Closed
sarahmish opened this issue Jul 13, 2021 · 14 comments · Fixed by #186
Closed

Auditing OneHotEncodingTransformer #178

sarahmish opened this issue Jul 13, 2021 · 14 comments · Fixed by #186
Assignees
Milestone

Comments

@sarahmish
Copy link
Contributor

sarahmish commented Jul 13, 2021

We audit time and memory performance of OneHotEncodingTransformer

fitting

most of the time is consumed by

self.dummies = list(pd.get_dummies(data, dummy_na=self.dummy_na).columns)

CPU times: user 3.3 ms, sys: 159 µs, total: 3.46 ms
Wall time: 6.41 ms
peak memory: 150.72 MiB, increment: 0.00 MiB

transform

similarly most of the time is consumed by

dummies = pd.get_dummies(data, dummy_na=self.dummy_na)

CPU times: user 3.18 ms, sys: 897 µs, total: 4.07 ms
Wall time: 4.87 ms
peak memory: 185.29 MiB, increment: 0.00 MiB

reverse transform

mapping the index back to the original map is what takes most of the time (but not that much)

return pd.Series(indices).map(self.dummies.__getitem__)

CPU times: user 1.89 ms, sys: 0 ns, total: 1.89 ms
Wall time: 5.48 ms
peak memory: 185.05 MiB, increment: 0.00 MiB
@sarahmish
Copy link
Contributor Author

sarahmish commented Jul 13, 2021

suggestions to improve fit

Fitting time can be improved, the objective of fit is to (1) record the list of dummies (2) check if nan is part of the dummies. proposed change:

def fit(self, data):
    data = self._prepare_data(data)
    data[pd.isnull(data)] = np.nan
    self.dummy_na = pd.isnull(data).any()
    self.dummies = list(pd.unique(data))

the issue with the implementation is that the assignment in data[pd.isnull(data)] = np.nan increases the time significantly.

in this case, the time it takes

CPU times: user 2.84 ms, sys: 0 ns, total: 2.84 ms
Wall time: 2.56 ms

@sarahmish
Copy link
Contributor Author

work around to remove the assignment operator

def fit(self, data):
    data = self._prepare_data(data)
    self.dummy_na = pd.isnull(data).any()
    self.dummies = list(pd.unique(data[~pd.isnull(data)]))
    if self.dummy_na:
        self.dummies.append(np.nan)

in this case, the time it takes

CPU times: user 1.62 ms, sys: 0 ns, total: 1.62 ms
Wall time: 2.8 ms

@sarahmish
Copy link
Contributor Author

suggestions to improve transform

some ideas to see if we can use numpy to make the transformation faster than pandas.get_dummies

def transform(self, data):
    data = self._prepare_data(data)
    data[pd.isnull(data)] = np.nan
    elements = np.vectorize(self.dummies.index)(data)
    array = np.eye(len(self.dummies))[elements].astype(int)
    for i, row in enumerate(array):
        if np.all(row == 0) and self.error_on_unknown:
            raise ValueError(f'The value {data[i]} was not seen during the fit stage.')

    return array

the problem with this implementation is that (1) it is not stable; we cannot use dummies.index to find the index of nan since nan != nan, (2) even without nan input, this ends up being slower than what we already have

CPU times: user 6.02 ms, sys: 0 ns, total: 6.02 ms
Wall time: 8.12 ms

@csala
Copy link
Contributor

csala commented Jul 14, 2021

Very nice report @sarahmish !

A few suggestions I would make are:

  1. It may be worth using timeit instead of time when we go down to comparing performance of specific lines or approaches
  2. A potential improvement could come from capturing certain things in variables or even instance attributes. For example, nulls = pd.isnull(data) would allow you to skip a pd.isnull(data) call in the fit implementation, and self.num_dummies = len(self.dummies) inside fit will allow you to skip calling len(self.dummies) at every transform call. These are probably minor optimizations, but they end up adding up anyway.

@csala
Copy link
Contributor

csala commented Jul 14, 2021

suggestions to improve transform

some ideas to see if we can use numpy to make the transformation faster than pandas.get_dummies

def transform(self, data):
    data = self._prepare_data(data)
    data[pd.isnull(data)] = np.nan
    elements = np.vectorize(self.dummies.index)(data)
    array = np.eye(len(self.dummies))[elements].astype(int)
    for i, row in enumerate(array):
        if np.all(row == 0) and self.error_on_unknown:
            raise ValueError(f'The value {data[i]} was not seen during the fit stage.')

    return array

the problem with this implementation is that (1) it is not stable; we cannot use dummies.index to find the index of nan since nan != nan, (2) even without nan input, this ends up being slower than what we already have

CPU times: user 6.02 ms, sys: 0 ns, total: 6.02 ms
Wall time: 8.12 ms

Here is another suggestion: What about using np.broadcast_to?

This basically takes a 1d array and replicates it the indicated number of times to make a 2d matrix with that array as rows.
Using this, you can create 2 matrices of the same shape (given rows, number of categories):

  • 1 matrix that has the categorical values as rows
  • another matrix that has the given data as columns

Afterwards you do an element wise comparison and get Trues where the category matches.

def get_dummies(data, categories):
    num_rows = len(data)
    num_categories = len(categories)
    dummies = np.broadcast_to(categories, (num_rows, num_categories))
    data = np.broadcast_to(data, (num_categories, num_rows)).T
    return (data == dummies).astype(int)

The result is a function that is much faster than pd.get_dummies:

image

WRT the nulls, you could do the same thing as in the fit: separate all the rows that are NaNs and just set 1s to an additional column for that.

@sarahmish
Copy link
Contributor Author

sarahmish commented Jul 14, 2021

@csala based on your suggestion I sketched out a new implementation for fit and transform, it looks something like this

def fit(self, data):
    data = self._prepare_data(data)
    null = pd.isnull(data)
    self.dummy_na = null.any()
    self.dummies = list(pd.unique(data[~null]))
    if self.dummy_na:
        self.dummies.append(np.nan)

    self.num_dummies = len(self.dummies)

def transform(self, data):
    data = self._prepare_data(data)
    num_rows = len(data)
    
    dummies = np.broadcast_to(self.dummies, (num_rows, self.num_dummies))
    coded = np.broadcast_to(data, (self.num_dummies, num_rows)).T
    array = (coded == dummies).astype(int)

    if self.dummy_na:
        null = pd.isnull(data)
        num_nulls = sum(null) 
        null_code = np.zeros(self.num_dummies)
        null_code[-1] = 1
        array[null] = np.broadcast_to(null_code, (num_nulls, self.num_dummies))        

    for i, row in enumerate(array):
        if np.all(row == 0) and self.error_on_unknown:
            raise ValueError(f'The value {data[i]} was not seen during the fit stage.')

    return array

I made dealing with na a completely separate block to reduce time consumption in case there wasn't any. I also assume that na will always be coded into [0, 0, ..., 1] based on our "append" of np.nan at the end of dummies.

I tested the transformers using timeit on multiple data generators of size 100

Current Transformer New Transformer
fit transform fit transform
Random Categories 577 µs 1.21 ms 328 µs 575 µs
Random Categories with np.nan 538 µs 1.18 ms 331 µs 604 µs
Random Categories with None 663 µs 1.36 ms 323 µs 786 µs
Random Categories with [ np.nan, None] 681 µs 1.35 ms 331 µs 808 µs
Unique Categories 428 µs 1.06 ms 323 µs 559 µs

here I can see that there is ~43% speedup in the fit method and ~50% speedup in transform. if I try something larger, testing size 10000 we get something like

Current Transformer New Transformer
fit transform fit transform
Random Categories 1.33 ms 50.9 ms 1.05 ms 51.5 ms
Random Categories with np.nan 1.37 ms 51.3 ms 1.09 ms 52.6 ms
Random Categories with None 1.58 ms 51.2 ms 1.03 ms 55.1 ms
Random Categories with [np.nan, None] 1.63 ms 51.8 ms 1.03 ms 56 ms
Unique Categories 14.2 ms 356 ms 1.02 ms 184 ms

in this case, I definitely see that the fit method improved, but the transform of the new transformer only beats the old one in one of the generators (unique generator).

@sarahmish
Copy link
Contributor Author

Here is another suggestion: What about using np.broadcast_to?

This basically takes a 1d array and replicates it the indicated number of times to make a 2d matrix with that array as rows.
Using this, you can create 2 matrices of the same shape (given rows, number of categories):

  • 1 matrix that has the categorical values as rows
  • another matrix that has the given data as columns

Afterwards you do an element wise comparison and get Trues where the category matches.

def get_dummies(data, categories):
    num_rows = len(data)
    num_categories = len(categories)
    dummies = np.broadcast_to(categories, (num_rows, num_categories))
    data = np.broadcast_to(data, (num_categories, num_rows)).T
    return (data == dummies).astype(int)

I tested this version of get_dummies on a larger input, in that case pandas wins.

Screen Shot 2021-07-14 at 5 48 14 PM

as the data gets larger, the numpy implementation gets a lot slower and pandas becomes a better option

@sarahmish
Copy link
Contributor Author

sarahmish commented Jul 14, 2021

one thing I realized from the test above, when the size of input is large the search for row is what is taking a lot of time, which makes sense because we are iterating row by row.

can we go from

for i, row in enumerate(array):
if np.all(row == 0) and self.error_on_unknown:
raise ValueError(f'The value {data[i]} was not seen during the fit stage.')

to something like

unknown = array.sum(axis=1) == 0
if self.error_on_unknown and unknown.any():
    raise ValueError(f'Attempted to transform {list(data[unknown])} that were not seen during fit stage.')

this is the cause for the extremely long runtime seen in the test above for an input of size 10000

@sarahmish
Copy link
Contributor Author

Stress testing the difference in transform between pandas and numpy implementation with the new error raising snippet in the previous comment

with pandas with numpy
size 100 size 100000 size 100 size 100000
Random Categories 1.24 ms 11.8 ms 114 µs 23.1 ms
Random Categories with np.nan 1.21 ms 13.1 ms 109 µs 31.6 ms
Random Categories with None 1.35 ms 13.8 ms 320 µs 49.3 ms
Random Categories with [np.nan, None] 1.35 ms 14.4 ms 323 µs 50.2 ms
Unique Categories 1.08 ms - 89.6 µs -

Notes

  • numpy is faster on smaller input
  • pandas is faster on large input
  • both approaches fail when the number of categories is large

Recommendation

  1. change fit to the new suggestion
  2. keep transform to use pandas and change the error raising part

@csala
Copy link
Contributor

csala commented Jul 15, 2021

Excellent report @sarahmish !

With regards to the recommendations, here's one thing to consider: the recursive nature of SDV relational models (HMA1 in particular) may provoke a lot calls to transform and reverse_transform with just a few rows, rather than a single call with the entire table.

So I would suggest to consider these two options:

  • having both implementations in separate private methods and decide which one to use based on the data size.
  • Looping over batches of rows (batch size could be an optional parameter) instead of doing all the rows at once.

@csala
Copy link
Contributor

csala commented Jul 15, 2021

  • both approaches fail when the number of categories is large

I'm curious about this. Why do they fail? Memory Error?

@sarahmish
Copy link
Contributor Author

  • both approaches fail when the number of categories is large

I'm curious about this. Why do they fail? Memory Error?

Yeah, colab runs out of memory and the kernel dies.

@sarahmish
Copy link
Contributor Author

  • having both implementations in separate private methods and decide which one to use based on the data size.

based on this suggestion, I experimented with some different input sizes on the Random Categories with [np.nan, None] generator to figure out when should we use each one.

drawing

starting at size=2800 the two approach start contending head to head; after size=3300 pandas becomes the clear winner.

@sarahmish
Copy link
Contributor Author

sarahmish commented Jul 22, 2021

More analysis was done and based on it we have the following recommendation.

Some rules to follow:

  1. when the type of the data is numeric array == array is a fast comparison.
  2. otherwise, map each element to a position and use that position as an encoding for broadcasting

the changes significantly increase the speed of the current implementation

fit

def fit(self, data):

  • current: 975 µs per loop
  • improvement: 480 µs per loop

transform

def transform(self, data):

  • current: 6.87 ms per loop
  • improvement: 1.43 ms per loop

@fealho fealho added this to the 0.5.1 milestone Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants