Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
create_arxiv_abstract.py	create_arxiv_abstract.py

Name

Last commit message

Last commit date

create_arxiv_abstract.py

MLRL Datasets

This directory contains dataset processing instructions for the MLRL experiments.

arXiv Abstract

This dataset is based on the arXiv scientific papers and is used for the text expansion task based on the abstracts.

Raw Data

Long and structured documents of arXiv scientific papers (Download).

The files are in JSONlines format, where each line is a JSON object corresponding to one scientific paper. The abstract, sections, and body are all sentence tokenized. The JSON objects are in the following format:

{ 
  'article_id': str,
  'abstract_text': List[str],
  'article_text': List[str],
  'section_names': List[str],
  'sections': List[List[str]]
}

Processed Dataset

The processed dataset is in the same JSONlines format as above, but only contains the article_id and abstract_text fields, and the abstract length should be 100-300 tokens. The JSON objects are in the following format:

{ 
  'article_id': str,
  'abstract_text': List[str],
  'token_count': int,
}

The processed dataset is now available on Huggingface OpenMLRL/arXiv_abstract.

CoopHumanEval

This dataset is based on the HumanEval benchmark and is used for the code generation task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

MLRL Datasets

arXiv Abstract

Raw Data

Processed Dataset

CoopHumanEval

FilesExpand file tree

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

MLRL Datasets

arXiv Abstract

Raw Data

Processed Dataset

CoopHumanEval