This dataset is built from 10-Q/K documents (Quarterly and Yearly Reports) of publicly listed companies on the SEC. To access these documents, follow this link. Please see sample.csv to find the instance of a document of the dataset. To get CIK of an organization, use the CIK_lookup in contents folder.
The data fields are the same among all splits.
text
: astring
in the form of entity plus sentence.label
: a string describing the relevant context for entity in the sentence
The dataset is split into train, validation, and test sets. The sizes of the splits are as follows:
Train | Validation | Test | |
---|---|---|---|
Instances | 1,498,995 | 187,383 | 187,383 |
Using the script dataset_generation_and_baseline.py will pull the the data from sec website and store it in content folder. cik_lookup.xlsx has the list of 2000 organizations whose data was pulled. The script will also run the baseline approach and store all the results in each organizations' excel respectively.
Once the dataset is created and baseline appraoch is executed and the excel is complete, use the script chatgpt_responses.py for getting the reuslts from ChatGPT. Please use your own API key for its execution.
Table 1 : Instance of the Dataset.
Sentence | value | entity type | Labels for each entity |
---|---|---|---|
As of August 5, 2019, there were 46,662,179 shares of common stock, $0.01 par value, outstanding. | 4,66,62,179 | CARDINAL | Shares Outstanding |
The Company also derecognized existing deferred rent liabilities of $15,302. | 15,302 | MONEY | Rent Expense |
The intangible assets acquired have a weighted average useful life of approximately nine years. | nine years | DATE | Intangible assets |
The initial purchase price of $31,676 included $30,176 cash consideration paid upon acquisition, funded primarily through borrowings under the Senior | 31,676 | MONEY | Initial purchase price |
Credit Facility, and a contingent earn out payment of up to $25,000 with an estimated fair value of $1,500 as of the acquisition date. | 30,176 | MONEY | Payments to Acquire Businesses |
Table 2: Statistics about dataset:
Use supervised_deepspeed_finetuning.sh for finetuning any model on EDGAR10-Q dataset.
[EDGAR-T5-Large] was finetuned on some downstream datasets to get better results than T5 large. BloombergGPT 50B was used as baseline.
Dataset | Bloomberg GPT 50B | T5 Large | Edgar T5 Large |
---|---|---|---|
FiQA SA | 75.07 | 74.89 | 80.42 |
FPB | 51.07 | 55.77 | 79.69 |
Headline | 82.20 | 90.55 | 93.55 |