EDGAR10-Q Dataset

This dataset is built from 10-Q/K documents (Quarterly and Yearly Reports) of publicly listed companies on the SEC. To access these documents, follow this link. Please see sample.csv to find the instance of a document of the dataset. To get CIK of an organization, use the CIK_lookup in contents folder.

Data Fields

The data fields are the same among all splits.

text: a string in the form of entity plus sentence.
label: a string describing the relevant context for entity in the sentence

Data Splits

The dataset is split into train, validation, and test sets. The sizes of the splits are as follows:

	Train	Validation	Test
Instances	1,498,995	187,383	187,383

Building Dataset

Using the script dataset_generation_and_baseline.py will pull the the data from sec website and store it in content folder. cik_lookup.xlsx has the list of 2000 organizations whose data was pulled. The script will also run the baseline approach and store all the results in each organizations' excel respectively.

ChatGPT response generation

Once the dataset is created and baseline appraoch is executed and the excel is complete, use the script chatgpt_responses.py for getting the reuslts from ChatGPT. Please use your own API key for its execution.

Table 1 : Instance of the Dataset.

Sentence	value	entity type	Labels for each entity
As of August 5, 2019, there were 46,662,179 shares of common stock, $0.01 par value, outstanding.	4,66,62,179	CARDINAL	Shares Outstanding
The Company also derecognized existing deferred rent liabilities of $15,302.	15,302	MONEY	Rent Expense
The intangible assets acquired have a weighted average useful life of approximately nine years.	nine years	DATE	Intangible assets
The initial purchase price of $31,676 included $30,176 cash consideration paid upon acquisition, funded primarily through borrowings under the Senior	31,676	MONEY	Initial purchase price
Credit Facility, and a contingent earn out payment of up to $25,000 with an estimated fair value of $1,500 as of the acquisition date.	30,176	MONEY	Payments to Acquire Businesses

Table 2: Statistics about dataset:

Results for baseline algorithm, ChatGPT responses and supervised learning models

Supervised Finetuning

Use supervised_deepspeed_finetuning.sh for finetuning any model on EDGAR10-Q dataset.

Results on Dowstream datasets

[EDGAR-T5-Large] was finetuned on some downstream datasets to get better results than T5 large. BloombergGPT 50B was used as baseline.

Dataset	Bloomberg GPT 50B	T5 Large	Edgar T5 Large
FiQA SA	75.07	74.89	80.42
FPB	51.07	55.77	79.69
Headline	82.20	90.55	93.55

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
content		content
supervised_model_output		supervised_model_output
README.md		README.md
Supervised_training_splits.zip		Supervised_training_splits.zip
chatgpt_response.csv		chatgpt_response.csv
chatgpt_responses.py		chatgpt_responses.py
dataset_generation_and_baseline.py		dataset_generation_and_baseline.py
edgar_statistics.ipynb		edgar_statistics.ipynb
license		license
run_model.py		run_model.py
stage3_config.json		stage3_config.json
supervised_deepspeed_finetuning.sh		supervised_deepspeed_finetuning.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDGAR10-Q Dataset

Data Fields

Data Splits

Building Dataset

ChatGPT response generation

Results for baseline algorithm, ChatGPT responses and supervised learning models

Supervised Finetuning

Results on Dowstream datasets

About

Releases

Packages

Contributors 2

Languages

License

him1411/edgar10q-dataset

Folders and files

Latest commit

History

Repository files navigation

EDGAR10-Q Dataset

Data Fields

Data Splits

Building Dataset

ChatGPT response generation

Results for baseline algorithm, ChatGPT responses and supervised learning models

Supervised Finetuning

Results on Dowstream datasets

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages