Auditing OneHotEncodingTransformer #178

sarahmish · 2021-07-13T17:10:39Z

We audit time and memory performance of OneHotEncodingTransformer

fitting

most of the time is consumed by

Line 233 in efddacf

self.dummies = list(pd.get_dummies(data, dummy_na=self.dummy_na).columns)

CPU times: user 3.3 ms, sys: 159 µs, total: 3.46 ms
Wall time: 6.41 ms
peak memory: 150.72 MiB, increment: 0.00 MiB

transform

similarly most of the time is consumed by

RDT/rdt/transformers/categorical.py

Line 246 in efddacf

dummies = pd.get_dummies(data, dummy_na=self.dummy_na)

CPU times: user 3.18 ms, sys: 897 µs, total: 4.07 ms
Wall time: 4.87 ms
peak memory: 185.29 MiB, increment: 0.00 MiB

reverse transform

mapping the index back to the original map is what takes most of the time (but not that much)

RDT/rdt/transformers/categorical.py

Line 268 in efddacf

return pd.Series(indices).map(self.dummies.__getitem__)

CPU times: user 1.89 ms, sys: 0 ns, total: 1.89 ms
Wall time: 5.48 ms
peak memory: 185.05 MiB, increment: 0.00 MiB

The text was updated successfully, but these errors were encountered:

sarahmish · 2021-07-13T17:42:06Z

suggestions to improve fit

Fitting time can be improved, the objective of fit is to (1) record the list of dummies (2) check if nan is part of the dummies. proposed change:

def fit(self, data):
    data = self._prepare_data(data)
    data[pd.isnull(data)] = np.nan
    self.dummy_na = pd.isnull(data).any()
    self.dummies = list(pd.unique(data))

the issue with the implementation is that the assignment in data[pd.isnull(data)] = np.nan increases the time significantly.

in this case, the time it takes

CPU times: user 2.84 ms, sys: 0 ns, total: 2.84 ms
Wall time: 2.56 ms

sarahmish · 2021-07-13T18:42:14Z

work around to remove the assignment operator

def fit(self, data):
    data = self._prepare_data(data)
    self.dummy_na = pd.isnull(data).any()
    self.dummies = list(pd.unique(data[~pd.isnull(data)]))
    if self.dummy_na:
        self.dummies.append(np.nan)

in this case, the time it takes

CPU times: user 1.62 ms, sys: 0 ns, total: 1.62 ms
Wall time: 2.8 ms

sarahmish · 2021-07-13T19:17:11Z

suggestions to improve transform

some ideas to see if we can use numpy to make the transformation faster than pandas.get_dummies

def transform(self, data):
    data = self._prepare_data(data)
    data[pd.isnull(data)] = np.nan
    elements = np.vectorize(self.dummies.index)(data)
    array = np.eye(len(self.dummies))[elements].astype(int)
    for i, row in enumerate(array):
        if np.all(row == 0) and self.error_on_unknown:
            raise ValueError(f'The value {data[i]} was not seen during the fit stage.')

    return array

the problem with this implementation is that (1) it is not stable; we cannot use dummies.index to find the index of nan since nan != nan, (2) even without nan input, this ends up being slower than what we already have

CPU times: user 6.02 ms, sys: 0 ns, total: 6.02 ms
Wall time: 8.12 ms

csala · 2021-07-14T16:15:09Z

Very nice report @sarahmish !

A few suggestions I would make are:

It may be worth using timeit instead of time when we go down to comparing performance of specific lines or approaches
A potential improvement could come from capturing certain things in variables or even instance attributes. For example, nulls = pd.isnull(data) would allow you to skip a pd.isnull(data) call in the fit implementation, and self.num_dummies = len(self.dummies) inside fit will allow you to skip calling len(self.dummies) at every transform call. These are probably minor optimizations, but they end up adding up anyway.

csala · 2021-07-14T16:38:31Z

suggestions to improve transform

some ideas to see if we can use numpy to make the transformation faster than pandas.get_dummies
def transform(self, data):
    data = self._prepare_data(data)
    data[pd.isnull(data)] = np.nan
    elements = np.vectorize(self.dummies.index)(data)
    array = np.eye(len(self.dummies))[elements].astype(int)
    for i, row in enumerate(array):
        if np.all(row == 0) and self.error_on_unknown:
            raise ValueError(f'The value {data[i]} was not seen during the fit stage.')

    return array
the problem with this implementation is that (1) it is not stable; we cannot use dummies.index to find the index of nan since nan != nan, (2) even without nan input, this ends up being slower than what we already have
CPU times: user 6.02 ms, sys: 0 ns, total: 6.02 ms
Wall time: 8.12 ms

Here is another suggestion: What about using np.broadcast_to?

This basically takes a 1d array and replicates it the indicated number of times to make a 2d matrix with that array as rows.
Using this, you can create 2 matrices of the same shape (given rows, number of categories):

1 matrix that has the categorical values as rows
another matrix that has the given data as columns

Afterwards you do an element wise comparison and get Trues where the category matches.

def get_dummies(data, categories):
    num_rows = len(data)
    num_categories = len(categories)
    dummies = np.broadcast_to(categories, (num_rows, num_categories))
    data = np.broadcast_to(data, (num_categories, num_rows)).T
    return (data == dummies).astype(int)

The result is a function that is much faster than pd.get_dummies:

WRT the nulls, you could do the same thing as in the fit: separate all the rows that are NaNs and just set 1s to an additional column for that.

sarahmish · 2021-07-14T21:37:53Z

@csala based on your suggestion I sketched out a new implementation for fit and transform, it looks something like this

def fit(self, data):
    data = self._prepare_data(data)
    null = pd.isnull(data)
    self.dummy_na = null.any()
    self.dummies = list(pd.unique(data[~null]))
    if self.dummy_na:
        self.dummies.append(np.nan)

    self.num_dummies = len(self.dummies)

def transform(self, data):
    data = self._prepare_data(data)
    num_rows = len(data)
    
    dummies = np.broadcast_to(self.dummies, (num_rows, self.num_dummies))
    coded = np.broadcast_to(data, (self.num_dummies, num_rows)).T
    array = (coded == dummies).astype(int)

    if self.dummy_na:
        null = pd.isnull(data)
        num_nulls = sum(null) 
        null_code = np.zeros(self.num_dummies)
        null_code[-1] = 1
        array[null] = np.broadcast_to(null_code, (num_nulls, self.num_dummies))        

    for i, row in enumerate(array):
        if np.all(row == 0) and self.error_on_unknown:
            raise ValueError(f'The value {data[i]} was not seen during the fit stage.')

    return array

I made dealing with na a completely separate block to reduce time consumption in case there wasn't any. I also assume that na will always be coded into [0, 0, ..., 1] based on our "append" of np.nan at the end of dummies.

I tested the transformers using timeit on multiple data generators of size 100

	Current Transformer		New Transformer
	`fit`	`transform`	`fit`	`transform`
Random Categories	577 µs	1.21 ms	328 µs	575 µs
Random Categories with `np.nan`	538 µs	1.18 ms	331 µs	604 µs
Random Categories with `None`	663 µs	1.36 ms	323 µs	786 µs
Random Categories with [ `np.nan`, `None`]	681 µs	1.35 ms	331 µs	808 µs
Unique Categories	428 µs	1.06 ms	323 µs	559 µs

here I can see that there is ~43% speedup in the fit method and ~50% speedup in transform. if I try something larger, testing size 10000 we get something like

	Current Transformer		New Transformer
	`fit`	`transform`	`fit`	`transform`
Random Categories	1.33 ms	50.9 ms	1.05 ms	51.5 ms
Random Categories with `np.nan`	1.37 ms	51.3 ms	1.09 ms	52.6 ms
Random Categories with `None`	1.58 ms	51.2 ms	1.03 ms	55.1 ms
Random Categories with [`np.nan`, `None`]	1.63 ms	51.8 ms	1.03 ms	56 ms
Unique Categories	14.2 ms	356 ms	1.02 ms	184 ms

in this case, I definitely see that the fit method improved, but the transform of the new transformer only beats the old one in one of the generators (unique generator).

sarahmish · 2021-07-14T21:50:19Z

Here is another suggestion: What about using np.broadcast_to?

This basically takes a 1d array and replicates it the indicated number of times to make a 2d matrix with that array as rows.
Using this, you can create 2 matrices of the same shape (given rows, number of categories):

1 matrix that has the categorical values as rows

another matrix that has the given data as columns

Afterwards you do an element wise comparison and get Trues where the category matches.
def get_dummies(data, categories):
    num_rows = len(data)
    num_categories = len(categories)
    dummies = np.broadcast_to(categories, (num_rows, num_categories))
    data = np.broadcast_to(data, (num_categories, num_rows)).T
    return (data == dummies).astype(int)

I tested this version of get_dummies on a larger input, in that case pandas wins.

as the data gets larger, the numpy implementation gets a lot slower and pandas becomes a better option

sarahmish · 2021-07-14T22:03:42Z

one thing I realized from the test above, when the size of input is large the search for row is what is taking a lot of time, which makes sense because we are iterating row by row.

can we go from

RDT/rdt/transformers/categorical.py

Lines 248 to 250 in 05b4036

    
           for i, row in enumerate(array): 
        
               if np.all(row == 0) and self.error_on_unknown: 
        
                   raise ValueError(f'The value {data[i]} was not seen during the fit stage.')

to something like

unknown = array.sum(axis=1) == 0
if self.error_on_unknown and unknown.any():
    raise ValueError(f'Attempted to transform {list(data[unknown])} that were not seen during fit stage.')

this is the cause for the extremely long runtime seen in the test above for an input of size 10000

sarahmish · 2021-07-15T14:38:19Z

Stress testing the difference in transform between pandas and numpy implementation with the new error raising snippet in the previous comment

	with `pandas`		with `numpy`
	size `100`	size `100000`	size `100`	size `100000`
Random Categories	1.24 ms	11.8 ms	114 µs	23.1 ms
Random Categories with `np.nan`	1.21 ms	13.1 ms	109 µs	31.6 ms
Random Categories with `None`	1.35 ms	13.8 ms	320 µs	49.3 ms
Random Categories with [`np.nan`, `None`]	1.35 ms	14.4 ms	323 µs	50.2 ms
Unique Categories	1.08 ms	-	89.6 µs	-

Notes

numpy is faster on smaller input
pandas is faster on large input
both approaches fail when the number of categories is large

Recommendation

change fit to the new suggestion
keep transform to use pandas and change the error raising part

csala · 2021-07-15T17:00:46Z

Excellent report @sarahmish !

With regards to the recommendations, here's one thing to consider: the recursive nature of SDV relational models (HMA1 in particular) may provoke a lot calls to transform and reverse_transform with just a few rows, rather than a single call with the entire table.

So I would suggest to consider these two options:

having both implementations in separate private methods and decide which one to use based on the data size.
Looping over batches of rows (batch size could be an optional parameter) instead of doing all the rows at once.

csala · 2021-07-15T17:01:09Z

both approaches fail when the number of categories is large

I'm curious about this. Why do they fail? Memory Error?

sarahmish · 2021-07-15T20:20:10Z

both approaches fail when the number of categories is large

I'm curious about this. Why do they fail? Memory Error?

Yeah, colab runs out of memory and the kernel dies.

sarahmish · 2021-07-15T20:28:07Z

having both implementations in separate private methods and decide which one to use based on the data size.

based on this suggestion, I experimented with some different input sizes on the Random Categories with [np.nan, None] generator to figure out when should we use each one.

starting at size=2800 the two approach start contending head to head; after size=3300 pandas becomes the clear winner.

sarahmish · 2021-07-22T04:56:31Z

More analysis was done and based on it we have the following recommendation.

Some rules to follow:

when the type of the data is numeric array == array is a fast comparison.
otherwise, map each element to a position and use that position as an encoding for broadcasting

the changes significantly increase the speed of the current implementation

fit

RDT/rdt/transformers/categorical.py

Line 295 in ca639dc

def fit(self, data):

current: 975 µs per loop
improvement: 480 µs per loop

transform

RDT/rdt/transformers/categorical.py

Line 308 in ca639dc

def transform(self, data):

current: 6.87 ms per loop
improvement: 1.43 ms per loop

katxiao assigned sarahmish Jul 13, 2021

sarahmish mentioned this issue Jul 15, 2021

Optimize one hot encoder #186

Merged

csala closed this as completed in #186 Jul 26, 2021

fealho added the enhancement label Aug 11, 2021

fealho added this to the 0.5.1 milestone Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auditing OneHotEncodingTransformer #178

Auditing OneHotEncodingTransformer #178

sarahmish commented Jul 13, 2021 •

edited

Loading

sarahmish commented Jul 13, 2021 •

edited

Loading

sarahmish commented Jul 13, 2021

sarahmish commented Jul 13, 2021

csala commented Jul 14, 2021

csala commented Jul 14, 2021

sarahmish commented Jul 14, 2021 •

edited

Loading

sarahmish commented Jul 14, 2021

sarahmish commented Jul 14, 2021 •

edited

Loading

sarahmish commented Jul 15, 2021

csala commented Jul 15, 2021 •

edited

Loading

csala commented Jul 15, 2021

sarahmish commented Jul 15, 2021

sarahmish commented Jul 15, 2021

sarahmish commented Jul 22, 2021 •

edited

Loading

Auditing OneHotEncodingTransformer #178

Auditing OneHotEncodingTransformer #178

Comments

sarahmish commented Jul 13, 2021 • edited Loading

fitting

transform

reverse transform

sarahmish commented Jul 13, 2021 • edited Loading

sarahmish commented Jul 13, 2021

sarahmish commented Jul 13, 2021

csala commented Jul 14, 2021

csala commented Jul 14, 2021

sarahmish commented Jul 14, 2021 • edited Loading

sarahmish commented Jul 14, 2021

sarahmish commented Jul 14, 2021 • edited Loading

sarahmish commented Jul 15, 2021

csala commented Jul 15, 2021 • edited Loading

csala commented Jul 15, 2021

sarahmish commented Jul 15, 2021

sarahmish commented Jul 15, 2021

sarahmish commented Jul 22, 2021 • edited Loading

fit

transform

sarahmish commented Jul 13, 2021 •

edited

Loading

sarahmish commented Jul 13, 2021 •

edited

Loading

sarahmish commented Jul 14, 2021 •

edited

Loading

sarahmish commented Jul 14, 2021 •

edited

Loading

csala commented Jul 15, 2021 •

edited

Loading

sarahmish commented Jul 22, 2021 •

edited

Loading