This directory contains dataset processing instructions for the MLRL experiments.
This dataset is based on the arXiv scientific papers and is used for the text expansion task based on the abstracts.
Long and structured documents of arXiv scientific papers (Download).
The files are in JSONlines format, where each line is a JSON object corresponding to one scientific paper. The abstract, sections, and body are all sentence tokenized. The JSON objects are in the following format:
{
'article_id': str,
'abstract_text': List[str],
'article_text': List[str],
'section_names': List[str],
'sections': List[List[str]]
}
The processed dataset is in the same JSONlines format as above, but only contains the article_id and abstract_text fields, and the abstract length should be 100-300 tokens. The JSON objects are in the following format:
{
'article_id': str,
'abstract_text': List[str],
'token_count': int,
}
The processed dataset is now available on Huggingface OpenMLRL/arXiv_abstract.
This dataset is based on the HumanEval benchmark and is used for the code generation task.