Description
EDGAR publishes a "full" index, that is an index of all filings in the current quarter, that pandas-datareader
now supports. Once the quarter ends, its indices are moved to a folder and presented as daily indices. So, this enhancement will support pulling together a historical index for those of us who would like to use historical filings. I'm currently working on building this.
At the moment, I'm not planning on building in document retrieval directly, but I do want to make that reasonably easy. In my mind, a workflow would look like this:
- Use
pandas-datareader
to pull the EDGAR index for the time period of interest. (This is wherepandas-datareader
's role ends.) - Use a list of CIK identifiers, particular filing types, or another dataset (via merge) to filter down the list of documents that you'd like to retrieve.
- Use
wget
,curl
, or whatever to pull the documents.
When I say make it easy, I mean that I'm doing things like making the filename paths consistent in the returned index so that you can just concatenate the server name and the path to have a full link. In older data, the directory paths are missing a directory (presumably because they hadn't named it "EDGAR" yet).
My thinking on keeping document retrieval out is that these should be one-time pulls that we shouldn't cache, and the performance of a dedicated tool for pulling (potentially hundreds of thousands) documents should be far better than our readily-available options. Still, it should be trivial to create a download list from the index we return.
Any thoughts?