Use Dictionary to Parse CPS DAT files #345

andersonfrailey · 2020-07-19T16:23:53Z

This PR addresses issue #342. Rather than having separate cpsmar201X.py files for each year of the CPS, we now have a single file, cpsmar.py that will use a dictionary containing all of the information on where in the raw DAT file the variable is. This dictionary is held in master_cps_dict.pkl. It's general structure is:

{
    cps_year: {
        "household": {"variable": (start_position, end_position, decimals)},
        "family": {"variable": (start_position, end_position, decimals)},
        "person": {"variable": (start_position, end_position, decimals)}
    }
}

start_position is the character in the string the variable starts
end_position is the character after the variable ends
decimals is how many decimals are implied for floating point variables. For example, in the SAS script that we parse to obtain all of this information, the line for hsup_wght is

@287  hsup_wgt        8.2

This implies that the variable hsup_wgt starts at character 287 (286 with zero-based numbering), takes up the next 8 characters, and has two implied decimal places. Most variables have no implied decimals and will be interpreted as integers. For those that do have implied decimals, we convert them to floats as we parse the file.

These changes have no effect on the final CPS file produced.

cc @MaxGhenis

MaxGhenis · 2020-07-21T00:13:11Z

⚡

Thanks for making this code clearer and simpler @andersonfrailey!

MaxGhenis · 2020-07-21T05:47:23Z

I gave this a whirl with 2018 data, and it worked great.

Might make sense as a separate PR, but I'm wondering if this might be an opportunity to also prepare for CSVs in future ASECs, assuming the pattern in 2019 continues (#346). I was looking at using the 2019 CSVs, but had trouble as some of the processing occurs in the data loading step.

Would it be possible to break these steps out? Something like this could be cleaner and more pandas-oriented:

Load raw data as DataFrames. The current process could probably be further simplified by using pd.read_fwf (I'd expect it to be faster too). For 2019+, it would just be loading the person CSV (I'm not seeing a need for the family and household records given the person records include all identifiers, but let me know if I'm missing something).
Merge person with family and household records (if needed). Again, not sure this is necessary, but if it is, it could be done by first merging on person.ph_seq = hh[hh.h_hhtype == 1].h_seq and then person_hh.ph_seq = family.fh_seq and person_hh.phf_seq = family.ffpos, per this SAS code.
Do person-level processing and C-TAM merging. e.g. replace person_details() in cpsmar.py with a set of pandas operations.
Assign each person to a tax unit. Essentially what's in pycps.py but taking a person-level DataFrame as input and then also add an explicit mapping from person to a tax unit ID (RECID?).
Do tax-unit-level processing. e.g. a lot of the aggregation in taxunit.py could be done instead with a person.groupby('tax_unit_id') type statement, then merging that back to the tax unit DataFrame.

This all goes beyond this PR but given my challenges running poverty analyses with taxcalc I'd love to see connection to raw ASEC become simpler, especially as the 2020 ASEC will be released in a couple months.

andersonfrailey · 2020-07-21T22:13:21Z

@MaxGhenis, it definitely is possible. I actually did initially convert the DAT files into DataFrames and then do almost everything with pandas operations up until we got to the tax unit formation. The reason I switched was for speed. I found that when I started the CPS off as a DataFrame, grouped by household, and then iterated through each household, it took around 20 minutes per file to create the tax units. When I simply kept everything as a list of lists of dictionaries, I got runtime down to the 30 seconds or so we have now.

The biggest choke point came when after I grouped all of the household I'd convert them to lists of dictionaries so that it'd be easier for me to wrap my head around how we were forming the tax units within each household. I'm sure that there is a way to avoid doing this and use all pandas operations to do all the searching/grouping we need, but that was well beyond what I knew about pandas at the time I started the conversion.

This is all just a longwinded way of saying, yes, we could structure things inputs as DataFrames, but it will cost us speed, or we'd need to refactor all of the tax unit creation functions to use pandas operations. Not having to wait on NBER to release SAS scripts for each CPS file would be great though so I'm open to discussing how we could do this in a separate issue.

andersonfrailey · 2020-07-28T18:23:50Z

@MaxGhenis, I'm going to go ahead and merge this PR since it resolves issue #342. But I think we should continue the conversation about accepting different file formats for the CPS in a separate issue. It'd be a great feature to add to the repo.

andersonfrailey added 3 commits July 18, 2020 20:29

first changes

89c16e4

Merge branch 'master' into dictcps

8414f7f

Use dictionary to parse CPS DAT files

ba68026

andersonfrailey added enhancement CPS review ready labels Jul 19, 2020

andersonfrailey linked an issue Jul 19, 2020 that may be closed by this pull request

Parse CPS dat files using file representation of column mapping (by parsing SAS script or data dictionary) #342

Closed

MaxGhenis mentioned this pull request Jul 21, 2020

Add 2019 CPS #346

Open

andersonfrailey merged commit 1fdf095 into PSLmodels:master Jul 28, 2020

andersonfrailey deleted the dictcps branch July 28, 2020 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Dictionary to Parse CPS DAT files #345

Use Dictionary to Parse CPS DAT files #345

Uh oh!

andersonfrailey commented Jul 19, 2020

Uh oh!

MaxGhenis commented Jul 21, 2020

Uh oh!

MaxGhenis commented Jul 21, 2020 •

edited

Loading

Uh oh!

andersonfrailey commented Jul 21, 2020

Uh oh!

andersonfrailey commented Jul 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use Dictionary to Parse CPS DAT files #345

Use Dictionary to Parse CPS DAT files #345

Uh oh!

Conversation

andersonfrailey commented Jul 19, 2020

Uh oh!

MaxGhenis commented Jul 21, 2020

Uh oh!

MaxGhenis commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andersonfrailey commented Jul 21, 2020

Uh oh!

andersonfrailey commented Jul 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MaxGhenis commented Jul 21, 2020 •

edited

Loading