Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy solution for restoring original dtypes #26

Closed
aldolamberti opened this issue Feb 4, 2020 · 10 comments
Closed

Easy solution for restoring original dtypes #26

aldolamberti opened this issue Feb 4, 2020 · 10 comments
Assignees
Labels
internal The issue doesn't change the API or functionality
Milestone

Comments

@aldolamberti
Copy link

  • CTGAN version: 2.0.1
  • Python version: 3.7
  • Operating System: MacOS

Description

After having sampled a dataset, we (@oregonpillow and I) encountered the fact that all numerical columns are converted to floats. However, we can simply restore the original dtype after sampling.

What I Did

data_dtype=original_df.dtypes.values
        for i in range(len(sampled_df.columns)):          
       sampled_df[sampled_df.columns[i]]=sampled_df[sampled_df.columns[i]].astype(data_dtype[i])

Question

Is this something we could consider implementing?

@csala csala mentioned this issue Mar 4, 2020
@csala
Copy link
Contributor

csala commented Mar 4, 2020

Thanks for the suggestion @aldolamberti !

Yes, as I just mentioned in @oregonpillow's PR, we'd be happy to accept a contribution about this!

Just one comment: I think the dtype conversion could be done more efficiently if you use pandas.DataFrame.astype method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

The solution would then be as simple as storing the original dtypes as a self attribute inside fit:

self.dtypes = train_data.dtypes

And then restoring them back in the last line of sample:

sampled = self.transformer.inverse_transform(data, None)
return sampled.astype(self.dtypes)

@oregonpillow
Copy link
Contributor

@csala thanks for the tip. However, when i tried self.dtypes = train_data.dtypes i get the error:
numpy.ndarray' object has no attribute 'dtypes

presumably because it's a numpy array and not a pandas dataframe, right?
So looking at numpy.ndarray.dtype documentation looks like numpy uses .dtype instead of pandas .dtypes.

What I did: self.dtypes = train_data.dtype inside fit then at the bottom of sample function i had:

sampled = self.transformer.inverse_transform(data, None)
return sampled.astype(self.dtypes)

but now i get this error when i try to sample after fitting:
ValueError: could not convert string to float: ' State-gov'

@csala
Copy link
Contributor

csala commented Mar 5, 2020

@oregonpillow I think that is because you are extracting the dtypes too late inside the fit method. The line I suggested (self.dtypes = train_data.dtypes) should be the first thing that happens inside fit, before going through the transformer (which, precisely, converts the data to numpy and adds new columns to encode the categoricals).

Then, you should be able to get dtypes instead of dtype (because train_data is a DataFrame), and the conversion inside sample will also work.

@oregonpillow
Copy link
Contributor

Thanks @csala

I tested the code manually in colab first: Google Colab and it works now!

However, when implementing this into my local fork and run tests I get the following error during make test:

E       AttributeError: 'numpy.ndarray' object has no attribute 'dtypes'

ctgan/synthesizer.py:116: AttributeError

---------- coverage: platform darwin, python 3.7.3-final-0 -----------
Name                   Stmts   Miss  Cover
------------------------------------------
ctgan/__init__.py          7      0   100%
ctgan/__main__.py         26     26     0%
ctgan/conditional.py      79      4    95%
ctgan/data.py             52     52     0%
ctgan/demo.py              4      1    75%
ctgan/models.py           50      0   100%
ctgan/sampler.py          34      3    91%
ctgan/synthesizer.py     141     10    93%
ctgan/transformer.py     110      4    96%
------------------------------------------
TOTAL                    503    100    80%

=================================================== 1 failed, 2 passed in 44.32s ===================================================
make: *** [test] Error 1

which makes no sense to me since it runs fine within colab. Any suggestions?

@csala
Copy link
Contributor

csala commented Mar 6, 2020

Sorry for the confusion @oregonpillow ! You are totally right: I overlooked the fact that the DataFrame vs numpy issue is being taken care by the transformer, not the synthesizer. So the dtypes conversion should also be done there, inside transformer.py, not inside synthesizer.py.

The way to go would be to capture the dypes inside the fit method, after converting the data to a dataframe and before looping over the columns (transformer.py line 69):

self.dtypes = data.dtypes

And then restore the dtypes in the current line 172, right after the column_stack:

output = np.column_stack(output).astype(self.dtypes)

Would you mind trying it this way?

@oregonpillow
Copy link
Contributor

oregonpillow commented Mar 7, 2020

@csala I reverted the synthesizer back it's default code in 0.2.1 and only changed the transformer.py as you suggested. It passed make test but it doesn't restore the sampled dtypes...?

Can you explain why yesterday when I tried the dtypes in synthesizer in Colab it worked great, yet the same code inside my local environment and usingmake test fails with the AttributeError: 'numpy.ndarray' object has no attribute 'dtypes' error?

@csala
Copy link
Contributor

csala commented Mar 7, 2020

You're right again @oregonpillow I had not tried it before myself, and it turns out that the dtype assignment only works when working with a DataFrame, as numpy has a single dtype for the whole array.

And I just spotted another related problem as well, which is that if data is passed as a numpy array since the beginning, the dtypes are not properly taken from it.

I would do the following changes:

  1. When capturing the dtypes add an infer_objects call before accessing the attribute.
    This will make pandas search for the best dtype for each column, fixing the problem when we have a numpy array as input.
        self.dtypes = data.infer_objects().dtypes
  1. When inverting the transform, invert the schema: instead of building a DF only if dataframe is true, always create a DF, restore the dtypes, and then only go back to numpy if dataframe is false:
        output = np.column_stack(output)
        output = pd.DataFrame(output, columns=column_names).astype(self.dtypes)
        if not self.dataframe:
            output = output.values

        return output

This time I made my homework and tried it myself, so I'm quite sure this works now ;-) Can you give it a try?

@oregonpillow
Copy link
Contributor

That works! :) Good idea @csala ! I was not aware of the infer_objects function in pandas.

To clarify my understanding on your 2nd point:

If we built a DF only if dataframe is true then we would not be able to restore the dtypes for numpy arrays...

Therefore your solution is by always creating a dataframe, we can always restore the dtypes, and if the input was a numpy array (dataframe is false), then we just return a numpy array (created from the DF values) since the DF has already had the dtypes restored.

Is my understanding correct?

csala added a commit that referenced this issue Mar 10, 2020
@csala
Copy link
Contributor

csala commented Mar 10, 2020

Therefore your solution is by always creating a dataframe, we can always restore the dtypes, and if the input was a numpy array (dataframe is false), then we just return a numpy array (created from the DF values) since the DF has already had the dtypes restored.

Is my understanding correct?

That's correct! There is also one additional thing to consider, which is that the numpy array in most cases will ignore the dtypes, and simply become the most broad one (i.e. object).

So, in other words: If a numpy is given as input, even if the individual columns are different dtypes, array dtype will beobject. Our infer_objects will catch that and fix the dtypes, which will be restored at the end on the DataFrame, but then they will be lost again when we return the .values.
However, this is all we can do, as the numpy array will never allow us to return individual dtypes per column. Also, the current approach gets as close as possible to the actual dtype, which means that an infer_objects outside of CTGAN should be able to recover the actual types.

@csala
Copy link
Contributor

csala commented Mar 10, 2020

Closed via #33 Thanks again @oregonpillow !

@csala csala closed this as completed Mar 10, 2020
@csala csala added the internal The issue doesn't change the API or functionality label Mar 10, 2020
@csala csala added this to the 0.2.2 milestone Mar 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal The issue doesn't change the API or functionality
Projects
None yet
Development

No branches or pull requests

3 participants