Skip to content

Conversation

@andersonfrailey
Copy link
Collaborator

This PR addresses issue #342. Rather than having separate cpsmar201X.py files for each year of the CPS, we now have a single file, cpsmar.py that will use a dictionary containing all of the information on where in the raw DAT file the variable is. This dictionary is held in master_cps_dict.pkl. It's general structure is:

{
    cps_year: {
        "household": {"variable": (start_position, end_position, decimals)},
        "family": {"variable": (start_position, end_position, decimals)},
        "person": {"variable": (start_position, end_position, decimals)}
    }
}

start_position is the character in the string the variable starts
end_position is the character after the variable ends
decimals is how many decimals are implied for floating point variables. For example, in the SAS script that we parse to obtain all of this information, the line for hsup_wght is

@287  hsup_wgt        8.2

This implies that the variable hsup_wgt starts at character 287 (286 with zero-based numbering), takes up the next 8 characters, and has two implied decimal places. Most variables have no implied decimals and will be interpreted as integers. For those that do have implied decimals, we convert them to floats as we parse the file.

These changes have no effect on the final CPS file produced.

cc @MaxGhenis

@MaxGhenis
Copy link
Contributor

image

Thanks for making this code clearer and simpler @andersonfrailey!

@MaxGhenis MaxGhenis mentioned this pull request Jul 21, 2020
@MaxGhenis
Copy link
Contributor

MaxGhenis commented Jul 21, 2020

I gave this a whirl with 2018 data, and it worked great.

Might make sense as a separate PR, but I'm wondering if this might be an opportunity to also prepare for CSVs in future ASECs, assuming the pattern in 2019 continues (#346). I was looking at using the 2019 CSVs, but had trouble as some of the processing occurs in the data loading step.

Would it be possible to break these steps out? Something like this could be cleaner and more pandas-oriented:

  1. Load raw data as DataFrames. The current process could probably be further simplified by using pd.read_fwf (I'd expect it to be faster too). For 2019+, it would just be loading the person CSV (I'm not seeing a need for the family and household records given the person records include all identifiers, but let me know if I'm missing something).
  2. Merge person with family and household records (if needed). Again, not sure this is necessary, but if it is, it could be done by first merging on person.ph_seq = hh[hh.h_hhtype == 1].h_seq and then person_hh.ph_seq = family.fh_seq and person_hh.phf_seq = family.ffpos, per this SAS code.
  3. Do person-level processing and C-TAM merging. e.g. replace person_details() in cpsmar.py with a set of pandas operations.
  4. Assign each person to a tax unit. Essentially what's in pycps.py but taking a person-level DataFrame as input and then also add an explicit mapping from person to a tax unit ID (RECID?).
  5. Do tax-unit-level processing. e.g. a lot of the aggregation in taxunit.py could be done instead with a person.groupby('tax_unit_id') type statement, then merging that back to the tax unit DataFrame.

This all goes beyond this PR but given my challenges running poverty analyses with taxcalc I'd love to see connection to raw ASEC become simpler, especially as the 2020 ASEC will be released in a couple months.

@andersonfrailey
Copy link
Collaborator Author

@MaxGhenis, it definitely is possible. I actually did initially convert the DAT files into DataFrames and then do almost everything with pandas operations up until we got to the tax unit formation. The reason I switched was for speed. I found that when I started the CPS off as a DataFrame, grouped by household, and then iterated through each household, it took around 20 minutes per file to create the tax units. When I simply kept everything as a list of lists of dictionaries, I got runtime down to the 30 seconds or so we have now.

The biggest choke point came when after I grouped all of the household I'd convert them to lists of dictionaries so that it'd be easier for me to wrap my head around how we were forming the tax units within each household. I'm sure that there is a way to avoid doing this and use all pandas operations to do all the searching/grouping we need, but that was well beyond what I knew about pandas at the time I started the conversion.

This is all just a longwinded way of saying, yes, we could structure things inputs as DataFrames, but it will cost us speed, or we'd need to refactor all of the tax unit creation functions to use pandas operations. Not having to wait on NBER to release SAS scripts for each CPS file would be great though so I'm open to discussing how we could do this in a separate issue.

@andersonfrailey
Copy link
Collaborator Author

@MaxGhenis, I'm going to go ahead and merge this PR since it resolves issue #342. But I think we should continue the conversation about accepting different file formats for the CPS in a separate issue. It'd be a great feature to add to the repo.

@andersonfrailey andersonfrailey merged commit 1fdf095 into PSLmodels:master Jul 28, 2020
@andersonfrailey andersonfrailey deleted the dictcps branch July 28, 2020 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parse CPS dat files using file representation of column mapping (by parsing SAS script or data dictionary)

2 participants