You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
it quickly falls apart if one tries to inflate long series because it inflates each value one at a time instead of taking advantage of numpy and pandas vectorization.
CPI already can handle numpy arrays and has both pandas and numpy as dependencies. (100,000,000 rows in less than 2 seconds, pretty cool.)
The problem:
CPI takes year_or_month as either int or a date object and retrieves the corresponding source_index from cpi.db. This, as far as I understand, would need to be done for every item in the array therefore it would still be very time-consuming for very large datasets.
The solution:
I still don't have any solid solutions.
One way to approach this could be:
receive a numpy array of dates for year_or_month
clean it so they all have 01 as day of month
grab the unique values in this array of dates
even if you have 100,000,000 rows, you definitely don't have 100,000,000 different year-month combinations.
BLS' data goes back to 1913 (2017-1913=104 years, 104 * 12 = 1248 months + 10 months of 2018 as of now = 1258 unique values at most)
create a numpy array of those values matching their date (or a dict() to later use .map() on the dates array.)
map the source_index values to the array of dates
look up the CPI value for each of those unique dates and map it back to the original numpy array of dates
cpi.inflate() already just multiplies (value * target_index) / float(source_index)
numpy will take care of the rest
Even though most likely one would be inflating values to one specific year or month, this method could be applied to both year_or_month and to to inflate a series of values from a series of dates to a different series of dates.
The use:
The particular use I came up with was normalizing different types of incomes from public use microdata. For example, if I go to ipums and grab ACS data from 2000-2016 for incomes (earned wages, household income, farm income, social security, etc).
There are only 16 distinct years but if I use pd.apply() it would go row by row and it would simply never end:
I don't have a experience with sqlite so I couldn't put together a proof of concept but I hope this explanation is helpful.
The text was updated successfully, but these errors were encountered:
I think I found a way to implement this method I was talking about but I'm not sure where it'd go in the code. I assume it'd be just another elif in inflate() where if the year_or_monthcpi.inflate() receives is a series this would go into effect. Here's a (simple) example notebook I put together: https://github.com/chekos/cpi/blob/master/example.ipynb
The elevator pitch is:
If you receive a series of dates, just grab the unique values, look up the index (source_index), save it as a dict, map the values back to the series. Now grab the to value, if it's a series, do the same. Now you have 3 series (of values, dates to inflate from, dates to inflate to). They are all pandas series or numpy arrays so you can just return (value * target_index) / source_index like you would regularly.
While
pd.apply()
works for small datasets like the example from the docsit quickly falls apart if one tries to inflate long series because it inflates each value one at a time instead of taking advantage of
numpy
andpandas
vectorization.CPI
already can handlenumpy
arrays and has bothpandas
andnumpy
as dependencies.(100,000,000 rows in less than 2 seconds, pretty cool.)
The problem:
CPI
takesyear_or_month
as eitherint
or adate
object and retrieves the correspondingsource_index
fromcpi.db
. This, as far as I understand, would need to be done for every item in the array therefore it would still be very time-consuming for very large datasets.The solution:
I still don't have any solid solutions.
One way to approach this could be:
receive a
numpy
array of dates foryear_or_month
grab the unique values in this array of dates
numpy
array of those values matching their date (or adict()
to later use.map()
on the dates array.)map the
source_index
values to the array of datesnumpy
array of datescpi.inflate()
already just multiplies(value * target_index) / float(source_index)
numpy
will take care of the restEven though most likely one would be inflating values to one specific year or month, this method could be applied to both
year_or_month
andto
to inflate a series of values from a series of dates to a different series of dates.The use:
The particular use I came up with was normalizing different types of incomes from public use microdata. For example, if I go to ipums and grab ACS data from 2000-2016 for incomes (earned wages, household income, farm income, social security, etc).
There are only 16 distinct years but if I use
pd.apply()
it would go row by row and it would simply never end:I don't have a experience with sqlite so I couldn't put together a proof of concept but I hope this explanation is helpful.
The text was updated successfully, but these errors were encountered: