Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code for collecting reginfo.gov data to the data update Python scripts #36

Open
zhoudanxie opened this issue Feb 23, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@zhoudanxie
Copy link
Collaborator

Data for the significant & economically significant rules prior to 2021 were collected from the Regulatory Review database on Reginfo.gov. The Python scripts for updating the data need to be revised to pull data from Reginfo.gov.

@zhoudanxie zhoudanxie added the enhancement New feature or request label Feb 23, 2024
@zhoudanxie zhoudanxie self-assigned this Feb 23, 2024
@zhoudanxie
Copy link
Collaborator Author

The Reg Stats data for significant & economically significant rules prior to 2021 were obtained from the Regulatory Review database on reginfo.gov. I investigated the possibility to pull the data from Regulatory Review XML reports, but it seems that there are discrepancies between the data in XMLs and the web search results.

For example, for the presidential year 2017 (Published Date Range = 02/01/2017-01/31/2018), the web search returns 77 significant rules and 22 economically significant rules, whereas the XMLs show only 40 significant rules and 14 economically significant rules published during the same time period. It suggests that the XML reports were not updated as publication dates become available. Therefore, it is impossible to obtain the same data from the XML reports.

The web search requires manual input of criteria, and the URL of the results page does not contain criteria values. I have no idea how to automate this process.

I also investigated the Federal Register API as an alternative source for this data. While incomplete and sometimes inaccurate, the rin_priority field from the FR indicates the significance designation from the Unified Agenda for documents published. However, the FR rules include corrections, extension of comment periods, etc., so the numbers returned are much larger than what we got from reginfo.gov.

In sum, I didn't find a good approach to automatically fetch and verifying data for significant & economically significant rules published prior to 2021. Any thoughts @mfebrizio @haysarah ?

@mfebrizio
Copy link
Collaborator

mfebrizio commented Jun 11, 2024

Thanks, Zoey. This is really helpful. It sounds like we should stick with the manual process in the meantime, but I have a few ideas:

  1. Automate filling out the search criteria on Reginfo. I think this might be possible with Python libraries that allow you to fill out webforms. Requests and selenium both have such options iirc. Selenium is commonly use for test automation, which I think has similarities to what we're doing here.
  2. Use the FR API rin_priority field but filter out corrections, extensions, etc. This is similar to a script I wrote for a different project. I can share that and get feedback on using that as a stopgap.
  3. Build a significant rules classifier for FR documents. This is an idea I've had for awhile, and it would take a lot of time and effort, but it would be the most scalable approach.
  4. Have one or more RAs work through the FR data in the same manner Dylan has, but just going back in time.

@mfebrizio
Copy link
Collaborator

the web search returns 77 significant rules and 22 economically significant rules, whereas the XMLs show only 40 significant rules and 14 economically significant rules published during the same time period. It suggests that the XML reports were not updated as publication dates become available.

One more thought: we should def reach out to reginfo and notify them of the inconsistency between the xml and search requests.

@zhoudanxie
Copy link
Collaborator Author

Thanks Mark! These are all good thoughts. Your idea 1 sounds promising; I'll explore the libraries that auto-fill webforms. Option 2 is also worth trying. If you can share your code for cleaning FR documents, I'll check to see how much it matches the reginfo.gov data.

Options 3 & 4 are more of longer term solutions. Manually going through all final rules back in time may be too time consuming, as our data go back to 1981, and we only need annual counts for the current Reg Stats charts, so rule-level details don't matter so much. In that sense, option 3 may be a more efficient approach.

I'll send an email to the reginfo.gov contact about the data inconsistency and see if they do anything.

@mfebrizio
Copy link
Collaborator

mfebrizio commented Jun 12, 2024

Re: option 2, I just added you to the repo with that code. Should be here.

If I am remembering correctly, it uses data output from your Unified Agenda compilation script, extracts the FR citations from the "actions" columns, then for remaining documents links the RIN to the UA RIN.

@mfebrizio
Copy link
Collaborator

In that sense, option 3 may be a more efficient approach.

And more fun :)

@zhoudanxie
Copy link
Collaborator Author

zhoudanxie commented Jun 12, 2024

Re: option 2, I just added you to the repo with that code. Should be here.

If I am remembering correctly, it uses data output from your Unified Agenda compilation script, extracts the FR citations from the "actions" columns, then for remaining documents links the RIN to the UA RIN.

Thanks! Does the rin_priority field from your fr-toolbelt come from the same source? I used that and thought it was obtained from the FR API.

@mfebrizio
Copy link
Collaborator

mfebrizio commented Jun 12, 2024

No, That's right, fr-toolbelt uses a field from the API, "regulation_id_numbers_info." I think that it was less complete than linking the FR documents to the UA data but I don't remember off the top of my head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants