Skip to content

newfacade/LeetCodeDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LeetCodeDataset

LeetCodeDataset is a dataset comprising Python LeetCode problems designed for training and evaluating Large Language Models (LLMs).

💻 Hugging Face Datasets 📄 Paper

Data Fields

The dataset adheres to the human-eval problem file format.

  • task_id: The LeetCode problem's question title slug, which corresponds to the problem URL.
  • question_id: The LeetCode problem's question ID.
  • difficulty: The problem's difficulty level (Easy, Medium, or Hard).
  • tags: E.g. ['Array', 'Hash Table']
  • problem_description: The problem description, including examples and constrains.
  • starter_code: The starter code to solve the problem.
  • estimated_date: The estimated release date.
  • prompt: The prefix for the completion, such as basic imports.
  • completion: The completion without the prompt.
  • entry_point: The function name used for evaluation.
  • test: A function to check test cases.
  • input_output: Test cases.
  • query: The query including problem description and starter code.
  • response: The correct response.

Training

LeetCodeDataset can be used for training as follows:

  1. The dataset is split into training and test sets. Problems are ordered by question_id, with those having larger question_id values used for the test set.
  2. Use query as the query and response as the response to train the LLM using the training split.

The number of problems in each version and split is as follows:

Version Train Test
v0.1.0 1570 175
v0.2.0 1890 200
v0.3.0 2386 386
v0.3.1 2641 228

Evaluation

Installation

git clone https://github.com/newfacade/LeetCodeDataset
pip install -e .

LeetCodeDataset Evaluation Example

eval_lcd --version v0.3.1 \
         --split test \
         --input_file ./data/LeetCodeDataset-v0.3.1-test.jsonl \
         --predict_column completion

Explanation of Parameters

  • version: dataset version.
  • split: test or train.
  • input_file: A JSONL file containing the problems and predictions for the specified LeetCodeDataset, with task_id and prediction.
  • predict_column: The column name of the prediction in input_file, e.g., {'task_id': 'two_sum', 'output': 'To solve the problem of finding two indices ...'} uses --predict_column output.

You can also perform custom evaluations using the evaluate_functional_correctness command, which is consistent with human-eval.

Data Curation

  1. Metadata Acquisition, including: – question id: unique numeric identifier – question: url-related string (serves as primary task id) – problem description – starter code
  2. Canonical Solution Verification
    • Retrieved reference solutions from GitHub open-source datasets
    • Validated solution correctness through LeetCode’s official execution environment
  3. Entry Point Identification: Implemented text pattern matching to detect target functions
  4. Test Case Generation
  5. Automated Evaluation Framework
    • Developed sandboxed execution environment for safe code evaluation
    • Implemented trial-and-error mechanism to Execute canonical solutions against generated inputs

Paper/blog/projects Using LeetCodeDataset

Citation

@misc{xia2025leetcodedatasettemporaldatasetrobust,
      title={LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs}, 
      author={Yunhui Xia and Wei Shen and Yan Wang and Jason Klein Liu and Huifeng Sun and Siyue Wu and Jian Hu and Xiaolong Xu},
      year={2025},
      eprint={2504.14655},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.14655}, 
}

🙏 Acknowledgment

About

LeetCode Training and Evaluation Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages