Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding support for datetime numpy arrays #36

Closed
chekos opened this issue Oct 15, 2018 · 2 comments
Closed

adding support for datetime numpy arrays #36

chekos opened this issue Oct 15, 2018 · 2 comments

Comments

@chekos
Copy link

chekos commented Oct 15, 2018

While pd.apply() works for small datasets like the example from the docs

df['ADJUSTED'] = df.apply(lambda x: cpi.inflate(x.MEDIAN_HOUSEHOLD_INCOME, x.YEAR), axis=1)

it quickly falls apart if one tries to inflate long series because it inflates each value one at a time instead of taking advantage of numpy and pandas vectorization.

CPI already can handle numpy arrays and has both pandas and numpy as dependencies.
cpi_incomes
(100,000,000 rows in less than 2 seconds, pretty cool.)

The problem:

CPI takes year_or_month as either int or a date object and retrieves the corresponding source_index from cpi.db. This, as far as I understand, would need to be done for every item in the array therefore it would still be very time-consuming for very large datasets.

The solution:

I still don't have any solid solutions.

One way to approach this could be:

  1. receive a numpy array of dates for year_or_month

    • clean it so they all have 01 as day of month
      cpi_dates
  2. grab the unique values in this array of dates

    • even if you have 100,000,000 rows, you definitely don't have 100,000,000 different year-month combinations.
      • BLS' data goes back to 1913 (2017-1913=104 years, 104 * 12 = 1248 months + 10 months of 2018 as of now = 1258 unique values at most)
    • create a numpy array of those values matching their date (or a dict() to later use .map() on the dates array.)
  3. map the source_index values to the array of dates

    • look up the CPI value for each of those unique dates and map it back to the original numpy array of dates
  4. cpi.inflate() already just multiplies (value * target_index) / float(source_index)

    • numpy will take care of the rest

Even though most likely one would be inflating values to one specific year or month, this method could be applied to both year_or_month and to to inflate a series of values from a series of dates to a different series of dates.

The use:

The particular use I came up with was normalizing different types of incomes from public use microdata. For example, if I go to ipums and grab ACS data from 2000-2016 for incomes (earned wages, household income, farm income, social security, etc).
There are only 16 distinct years but if I use pd.apply() it would go row by row and it would simply never end:
acs2000-16


I don't have a experience with sqlite so I couldn't put together a proof of concept but I hope this explanation is helpful.

@chekos
Copy link
Author

chekos commented Oct 22, 2018

I think I found a way to implement this method I was talking about but I'm not sure where it'd go in the code. I assume it'd be just another elif in inflate() where if the year_or_month cpi.inflate() receives is a series this would go into effect. Here's a (simple) example notebook I put together: https://github.com/chekos/cpi/blob/master/example.ipynb

The elevator pitch is:
If you receive a series of dates, just grab the unique values, look up the index (source_index), save it as a dict, map the values back to the series. Now grab the to value, if it's a series, do the same. Now you have 3 series (of values, dates to inflate from, dates to inflate to). They are all pandas series or numpy arrays so you can just return (value * target_index) / source_index like you would regularly.

@palewire
Copy link
Owner

If you wanted to submit a pull request on this I'd appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants