Easy solution for restoring original dtypes #26

aldolamberti · 2020-02-04T15:36:57Z

CTGAN version: 2.0.1
Python version: 3.7
Operating System: MacOS

Description

After having sampled a dataset, we (@oregonpillow and I) encountered the fact that all numerical columns are converted to floats. However, we can simply restore the original dtype after sampling.

What I Did

data_dtype=original_df.dtypes.values
        for i in range(len(sampled_df.columns)):          
       sampled_df[sampled_df.columns[i]]=sampled_df[sampled_df.columns[i]].astype(data_dtype[i])

Question

Is this something we could consider implementing?

The text was updated successfully, but these errors were encountered:

csala · 2020-03-04T18:31:15Z

Thanks for the suggestion @aldolamberti !

Yes, as I just mentioned in @oregonpillow's PR, we'd be happy to accept a contribution about this!

Just one comment: I think the dtype conversion could be done more efficiently if you use pandas.DataFrame.astype method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

The solution would then be as simple as storing the original dtypes as a self attribute inside fit:

self.dtypes = train_data.dtypes

And then restoring them back in the last line of sample:

sampled = self.transformer.inverse_transform(data, None)
return sampled.astype(self.dtypes)

oregonpillow · 2020-03-05T16:57:45Z

@csala thanks for the tip. However, when i tried self.dtypes = train_data.dtypes i get the error:
numpy.ndarray' object has no attribute 'dtypes

presumably because it's a numpy array and not a pandas dataframe, right?
So looking at numpy.ndarray.dtype documentation looks like numpy uses .dtype instead of pandas .dtypes.

What I did: self.dtypes = train_data.dtype inside fit then at the bottom of sample function i had:

sampled = self.transformer.inverse_transform(data, None)
return sampled.astype(self.dtypes)

but now i get this error when i try to sample after fitting:
ValueError: could not convert string to float: ' State-gov'

csala · 2020-03-05T17:54:45Z

@oregonpillow I think that is because you are extracting the dtypes too late inside the fit method. The line I suggested (self.dtypes = train_data.dtypes) should be the first thing that happens inside fit, before going through the transformer (which, precisely, converts the data to numpy and adds new columns to encode the categoricals).

Then, you should be able to get dtypes instead of dtype (because train_data is a DataFrame), and the conversion inside sample will also work.

oregonpillow · 2020-03-06T15:11:12Z

Thanks @csala

I tested the code manually in colab first: Google Colab and it works now!

However, when implementing this into my local fork and run tests I get the following error during make test:

E       AttributeError: 'numpy.ndarray' object has no attribute 'dtypes'

ctgan/synthesizer.py:116: AttributeError

---------- coverage: platform darwin, python 3.7.3-final-0 -----------
Name                   Stmts   Miss  Cover
------------------------------------------
ctgan/__init__.py          7      0   100%
ctgan/__main__.py         26     26     0%
ctgan/conditional.py      79      4    95%
ctgan/data.py             52     52     0%
ctgan/demo.py              4      1    75%
ctgan/models.py           50      0   100%
ctgan/sampler.py          34      3    91%
ctgan/synthesizer.py     141     10    93%
ctgan/transformer.py     110      4    96%
------------------------------------------
TOTAL                    503    100    80%

=================================================== 1 failed, 2 passed in 44.32s ===================================================
make: *** [test] Error 1

which makes no sense to me since it runs fine within colab. Any suggestions?

csala · 2020-03-06T18:13:13Z

Sorry for the confusion @oregonpillow ! You are totally right: I overlooked the fact that the DataFrame vs numpy issue is being taken care by the transformer, not the synthesizer. So the dtypes conversion should also be done there, inside transformer.py, not inside synthesizer.py.

The way to go would be to capture the dypes inside the fit method, after converting the data to a dataframe and before looping over the columns (transformer.py line 69):

self.dtypes = data.dtypes

And then restore the dtypes in the current line 172, right after the column_stack:

output = np.column_stack(output).astype(self.dtypes)

Would you mind trying it this way?

oregonpillow · 2020-03-07T12:36:18Z

@csala I reverted the synthesizer back it's default code in 0.2.1 and only changed the transformer.py as you suggested. It passed make test but it doesn't restore the sampled dtypes...?

Can you explain why yesterday when I tried the dtypes in synthesizer in Colab it worked great, yet the same code inside my local environment and usingmake test fails with the AttributeError: 'numpy.ndarray' object has no attribute 'dtypes' error?

csala · 2020-03-07T14:51:05Z

You're right again @oregonpillow I had not tried it before myself, and it turns out that the dtype assignment only works when working with a DataFrame, as numpy has a single dtype for the whole array.

And I just spotted another related problem as well, which is that if data is passed as a numpy array since the beginning, the dtypes are not properly taken from it.

I would do the following changes:

When capturing the dtypes add an infer_objects call before accessing the attribute.
This will make pandas search for the best dtype for each column, fixing the problem when we have a numpy array as input.

        self.dtypes = data.infer_objects().dtypes

When inverting the transform, invert the schema: instead of building a DF only if dataframe is true, always create a DF, restore the dtypes, and then only go back to numpy if dataframe is false:

        output = np.column_stack(output)
        output = pd.DataFrame(output, columns=column_names).astype(self.dtypes)
        if not self.dataframe:
            output = output.values

        return output

This time I made my homework and tried it myself, so I'm quite sure this works now ;-) Can you give it a try?

oregonpillow · 2020-03-08T12:44:00Z

That works! :) Good idea @csala ! I was not aware of the infer_objects function in pandas.

To clarify my understanding on your 2nd point:

If we built a DF only if dataframe is true then we would not be able to restore the dtypes for numpy arrays...

Therefore your solution is by always creating a dataframe, we can always restore the dtypes, and if the input was a numpy array (dataframe is false), then we just return a numpy array (created from the DF values) since the DF has already had the dtypes restored.

Is my understanding correct?

…ture gh-26-dtypes-restoration-feature

csala · 2020-03-10T10:08:41Z

Therefore your solution is by always creating a dataframe, we can always restore the dtypes, and if the input was a numpy array (dataframe is false), then we just return a numpy array (created from the DF values) since the DF has already had the dtypes restored.

Is my understanding correct?

That's correct! There is also one additional thing to consider, which is that the numpy array in most cases will ignore the dtypes, and simply become the most broad one (i.e. object).

So, in other words: If a numpy is given as input, even if the individual columns are different dtypes, array dtype will beobject. Our infer_objects will catch that and fix the dtypes, which will be restored at the end on the DataFrame, but then they will be lost again when we return the .values.
However, this is all we can do, as the numpy array will never allow us to return individual dtypes per column. Also, the current approach gets as close as possible to the actual dtype, which means that an infer_objects outside of CTGAN should be able to recover the actual types.

csala · 2020-03-10T10:09:04Z

Closed via #33 Thanks again @oregonpillow !

csala mentioned this issue Mar 4, 2020

Update README.md #30

Closed

csala added a commit that referenced this issue Mar 10, 2020

Merge pull request #33 from oregonpillow/gh-26-dtypes-restoration-fea…

cec62a0

…ture gh-26-dtypes-restoration-feature

csala closed this as completed Mar 10, 2020

csala assigned oregonpillow Mar 10, 2020

csala added the internal The issue doesn't change the API or functionality label Mar 10, 2020

csala added this to the 0.2.2 milestone Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easy solution for restoring original dtypes #26

Easy solution for restoring original dtypes #26

aldolamberti commented Feb 4, 2020

csala commented Mar 4, 2020

oregonpillow commented Mar 5, 2020

csala commented Mar 5, 2020

oregonpillow commented Mar 6, 2020

csala commented Mar 6, 2020

oregonpillow commented Mar 7, 2020 •

edited

Loading

csala commented Mar 7, 2020

oregonpillow commented Mar 8, 2020

csala commented Mar 10, 2020

csala commented Mar 10, 2020

Easy solution for restoring original dtypes #26

Easy solution for restoring original dtypes #26

Comments

aldolamberti commented Feb 4, 2020

Description

What I Did

Question

csala commented Mar 4, 2020

oregonpillow commented Mar 5, 2020

csala commented Mar 5, 2020

oregonpillow commented Mar 6, 2020

csala commented Mar 6, 2020

oregonpillow commented Mar 7, 2020 • edited Loading

csala commented Mar 7, 2020

oregonpillow commented Mar 8, 2020

csala commented Mar 10, 2020

csala commented Mar 10, 2020

oregonpillow commented Mar 7, 2020 •

edited

Loading