This Python script fetches and processes financial disclosure reports from the website of the United States House of Representatives. The primary goal of this tool is to automate the retrieval and preliminary processing of these reports, making them easily accessible for further analysis.
Here is a step-by-step breakdown of the script's functionality:
-
Importing required libraries:
- Utilizes the following libraries:
datetime
,wget
,pandas
,zipfile
,xmlutils
, andos.path
.
- Utilizes the following libraries:
-
Getting the current date and year:
- Determines the current year and date using the
datetime
library.
- Determines the current year and date using the
-
Downloading the ZIP file:
- Uses the
wget
library to fetch the ZIP file containing the financial disclosure reports for the present year directly from the U.S. House of Representatives' website.
- Uses the
-
Extracting the XML file:
- Employs the
zipfile
library to extract the XML content from the recently downloaded ZIP file.
- Employs the
-
Converting XML to CSV:
- Makes use of the
xmlutils
library to transform the XML file into a more manageable CSV format.
- Makes use of the
-
Getting the last 10 days' data and sorting the list:
- Uses the
pandas
library to:- Load the CSV data.
- Filter records from the past 40 days (despite the "last 10 days" name, the script actually extracts the last 40 days of data).
- Sort these records by their filing date in descending order.
- Save this processed data into a new CSV file.
- Uses the
-
Getting the document IDs:
- Extracts document IDs from the sorted CSV, storing them in a list for easy access.
-
Downloading the PDF files:
- Again, taps into the
wget
library to download the PDF files for each document ID. - Only downloads if the PDF does not already reside on the user's local machine, ensuring no redundant downloads.
- Again, taps into the
-
Cleaning up files:
- Deletes any intermediate files created during the process. This includes the initial ZIP file, XML content, and all generated CSV files, ensuring a clean workspace post-execution.
-
Completion Notification:
- Upon successful execution, the script proudly announces its completion with a "DONE!!!" message.
- Ensure you have all the mentioned libraries installed in your Python environment.
- Clone or download the script from this GitHub repository.
- Execute the script in your preferred Python environment or terminal.
- Once the script completes, you should have the desired PDFs downloaded in the specified directory.
- Always ensure you have the required storage space available, especially if running this script frequently, as financial disclosure reports can be sizable.
- Regularly check the U.S. House of Representatives' website structure. If they undergo any structural changes to their site or the way reports are stored, the script might need adjustments.
If you found this helpful, please consider:
- Buymeacoffee: Link