-
Notifications
You must be signed in to change notification settings - Fork 30
Use Dictionary to Parse CPS DAT files #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
⚡ Thanks for making this code clearer and simpler @andersonfrailey! |
|
I gave this a whirl with 2018 data, and it worked great. Might make sense as a separate PR, but I'm wondering if this might be an opportunity to also prepare for CSVs in future ASECs, assuming the pattern in 2019 continues (#346). I was looking at using the 2019 CSVs, but had trouble as some of the processing occurs in the data loading step. Would it be possible to break these steps out? Something like this could be cleaner and more
This all goes beyond this PR but given my challenges running poverty analyses with taxcalc I'd love to see connection to raw ASEC become simpler, especially as the 2020 ASEC will be released in a couple months. |
|
@MaxGhenis, it definitely is possible. I actually did initially convert the DAT files into DataFrames and then do almost everything with pandas operations up until we got to the tax unit formation. The reason I switched was for speed. I found that when I started the CPS off as a DataFrame, grouped by household, and then iterated through each household, it took around 20 minutes per file to create the tax units. When I simply kept everything as a list of lists of dictionaries, I got runtime down to the 30 seconds or so we have now. The biggest choke point came when after I grouped all of the household I'd convert them to lists of dictionaries so that it'd be easier for me to wrap my head around how we were forming the tax units within each household. I'm sure that there is a way to avoid doing this and use all pandas operations to do all the searching/grouping we need, but that was well beyond what I knew about pandas at the time I started the conversion. This is all just a longwinded way of saying, yes, we could structure things inputs as DataFrames, but it will cost us speed, or we'd need to refactor all of the tax unit creation functions to use pandas operations. Not having to wait on NBER to release SAS scripts for each CPS file would be great though so I'm open to discussing how we could do this in a separate issue. |
|
@MaxGhenis, I'm going to go ahead and merge this PR since it resolves issue #342. But I think we should continue the conversation about accepting different file formats for the CPS in a separate issue. It'd be a great feature to add to the repo. |

This PR addresses issue #342. Rather than having separate
cpsmar201X.pyfiles for each year of the CPS, we now have a single file,cpsmar.pythat will use a dictionary containing all of the information on where in the raw DAT file the variable is. This dictionary is held inmaster_cps_dict.pkl. It's general structure is:{ cps_year: { "household": {"variable": (start_position, end_position, decimals)}, "family": {"variable": (start_position, end_position, decimals)}, "person": {"variable": (start_position, end_position, decimals)} } }start_positionis the character in the string the variable startsend_positionis the character after the variable endsdecimalsis how many decimals are implied for floating point variables. For example, in the SAS script that we parse to obtain all of this information, the line forhsup_wghtisThis implies that the variable
hsup_wgtstarts at character 287 (286 with zero-based numbering), takes up the next 8 characters, and has two implied decimal places. Most variables have no implied decimals and will be interpreted as integers. For those that do have implied decimals, we convert them to floats as we parse the file.These changes have no effect on the final CPS file produced.
cc @MaxGhenis