adding support for datetime numpy arrays #36

chekos · 2018-10-15T19:26:11Z

While pd.apply() works for small datasets like the example from the docs

df['ADJUSTED'] = df.apply(lambda x: cpi.inflate(x.MEDIAN_HOUSEHOLD_INCOME, x.YEAR), axis=1)

it quickly falls apart if one tries to inflate long series because it inflates each value one at a time instead of taking advantage of numpy and pandas vectorization.

CPI already can handle numpy arrays and has both pandas and numpy as dependencies.

(100,000,000 rows in less than 2 seconds, pretty cool.)

The problem:

CPI takes year_or_month as either int or a date object and retrieves the corresponding source_index from cpi.db. This, as far as I understand, would need to be done for every item in the array therefore it would still be very time-consuming for very large datasets.

The solution:

I still don't have any solid solutions.

One way to approach this could be:

receive a numpy array of dates for year_or_month
- clean it so they all have 01 as day of month
grab the unique values in this array of dates
- even if you have 100,000,000 rows, you definitely don't have 100,000,000 different year-month combinations.
  - BLS' data goes back to 1913 (2017-1913=104 years, 104 * 12 = 1248 months + 10 months of 2018 as of now = 1258 unique values at most)
- create a numpy array of those values matching their date (or a dict() to later use .map() on the dates array.)
map the source_index values to the array of dates
- look up the CPI value for each of those unique dates and map it back to the original numpy array of dates
cpi.inflate() already just multiplies (value * target_index) / float(source_index)
- numpy will take care of the rest

Even though most likely one would be inflating values to one specific year or month, this method could be applied to both year_or_month and to to inflate a series of values from a series of dates to a different series of dates.

The use:

The particular use I came up with was normalizing different types of incomes from public use microdata. For example, if I go to ipums and grab ACS data from 2000-2016 for incomes (earned wages, household income, farm income, social security, etc).
There are only 16 distinct years but if I use pd.apply() it would go row by row and it would simply never end:

I don't have a experience with sqlite so I couldn't put together a proof of concept but I hope this explanation is helpful.

The text was updated successfully, but these errors were encountered:

chekos · 2018-10-22T18:58:36Z

I think I found a way to implement this method I was talking about but I'm not sure where it'd go in the code. I assume it'd be just another elif in inflate() where if the year_or_month cpi.inflate() receives is a series this would go into effect. Here's a (simple) example notebook I put together: https://github.com/chekos/cpi/blob/master/example.ipynb

The elevator pitch is:
If you receive a series of dates, just grab the unique values, look up the index (source_index), save it as a dict, map the values back to the series. Now grab the to value, if it's a series, do the same. Now you have 3 series (of values, dates to inflate from, dates to inflate to). They are all pandas series or numpy arrays so you can just return (value * target_index) / source_index like you would regularly.

palewire · 2023-03-14T14:08:37Z

If you wanted to submit a pull request on this I'd appreciate it.

palewire closed this as completed Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding support for datetime numpy arrays #36

adding support for datetime numpy arrays #36

chekos commented Oct 15, 2018

chekos commented Oct 22, 2018

palewire commented Mar 14, 2023

adding support for datetime numpy arrays #36

adding support for datetime numpy arrays #36

Comments

chekos commented Oct 15, 2018

The problem:

The solution:

The use:

chekos commented Oct 22, 2018

palewire commented Mar 14, 2023