Skip to content

Commit

Permalink
Refactor & update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
nmd2k committed May 4, 2023
1 parent b56498b commit 61c770d
Show file tree
Hide file tree
Showing 21 changed files with 184 additions and 3,143 deletions.
177 changes: 93 additions & 84 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
<img src="https://avatars.githubusercontent.com/u/115590550?s=200&v=4" width="220px" alt="logo">
</p>

**Code-Text data toolkit**
______________________________________________________________________
**The Vault: Open source parallel data extractor**
__________________________


<!-- Badge start -->
Expand All @@ -15,69 +15,96 @@ ______________________________________________________________________
<!-- Badge end -->
</div>

______________________________________________________________________
# Relevant Links
[The Vault paper](https://arxiv.org) | [The Vault on HuggingFace datasets](https://huggingface.co/datasets?search) <img alt="Hugging Face Datasets" src="https://img.shields.io/badge/-%F0%9F%A4%97%20datasets-blue"> </a >

**Code-Text data toolkit** contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level).
__________________
# Table of content
1. [The Vault](#the-vault)
i. [Data Summary](#data-summary)
ii. [Data Structure](#data-structure)
iii. [Data Split](#data-split)
2. [CodeText toolkit](#codetext-toolkit)
i. [Installation](#installation)
ii. [Processing Pipeline](#processing-pipeline)
iii. [Processing Custom Dataset](#processing-custom-dataset)

___________
# The Vault Dataset
## Data Summary
The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.

We design The Vault to extract code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.

![Something something](./assets/Poster_The%20Vault.jpg)
## Data Structure

## Data Split

## Load dataset
We support load our dataset via Huggingface datasets hub:

```python
!pip install datasets

from datasets import load_dataset

# Load full function dataset (40M samples)
ds = load_dataset("NamCyan/thevault", split="function")

# Load function "small" trainset (or "medium", "large")
ds = load_dataset("NamCyan/thevault", split="function/train_small")

# Load only function testset
ds = load_dataset("NamCyan/thevault", split="function/test")

# specific language (e.g. Golang)
ds = load_dataset("NamCyan/thevault", split="function/train", languages=['Go'])

# streaming load (that will only download the data as needed)
ds = load_dataset("NamCyan/thevault", split="function/train", streaming=True)

# Installation
Setup environment and install dependencies and setup by using `install_env.sh`
```bash
bash -i ./install_env.sh
```
then activate conda environment named "code-text-env"
# The Vault Toolkit
## Getting Started

To setup environment and install dependencies via `pip`:
```bash
conda activate code-text-env
pip -r install requirements.txt
```

*Setup for using parser*
Install `codetext` parser to extract code using [tree-sitter](https://tree-sitter.github.io/tree-sitter/), via `pip`:
```bash
pip install codetext
```

# Getting started

## Build your language
Auto build tree-sitter into `<language>.so` located in `/tree-sitter/`
```python
from codetext.utils import build_language

language = 'rust'
build_language(language)
Or manually build `codetext` form source, see more at [`Codetext` repo](https://github.com/FSoft-AI4Code/CodeText-parser)
```bash
git clone https://github.com/FSoft-AI4Code/CodeText-parser.git
cd CodeText-parser
pip install -e .
```

## Processing Pipeline

# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so
```
### Extracting raw code

Parse code to `tree-sitter.Tree`
```python
from codetext.utils import parse_code

raw_code = """
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
return a + b;
}
"""
### Filtering extracted code snippet

root = parse_code(raw_code, 'cpp')
root_node = root.root_node
```
### Processing Custom Dataset
We create a `.yaml` to define which field to load when processing data. Usually, only source code are needed, but in case there are other additional information about the raw code might be added using the `.yaml`.

# Data collection and Preprocessing
The dataset we used to extract was collected by codeparrot. They host the raw dataset in here [codeparrot/github-code](https://huggingface.co/datasets/codeparrot/github-code).
For example, `CodeSearchNet` stores their data in structure:

*You can create your own dataset using Google Bigquery and the [query here](https://huggingface.co/datasets/codeparrot/github-code/blob/main/query.sql)*
```yaml
# CodeSearchNet jsonline format
# https://github.com/github/CodeSearchNet#data-details

## Getting started
### Process custom dataset
For start preprocessing data, define a .yaml file to declare raw data format. (More detail: `/data/format/README.md`)
code: original_string # raw code
repo: repo # additional infor
path: path # additional infor
language: language # additional infor
```
```bash
python -m codetext.processing
Expand Down Expand Up @@ -115,41 +142,23 @@ options:
--debug
```

### Analyse and split dataset
The code process is going to save cleaned sample by batch, you can merge it using `postprocess.py`. We also provide analyse tool for get total number of sample, blank_line(\*), comment(\*) and code(\*). You can also split your dataset into `train`, `valid`, `test`.

```bash
python -m codetext.postprocessing
<DATASET_PATH> # path to dir contains /extracted, /filered, /raw
--save_path <SAVE_PATH> # path to save final output

--n_core 10 # number of core for multiprocessing analyzer
--analyze # Analyze trigger
--split # Split train/test/valid trigger
--ratio 0.05 # Test and valid ratio (defaul to equal)
--max_sample 20000 # Max size of test set and valid set
```
## Technical Report and Citing the Vault
More details can be found in our [technical report](https://arxiv.org/abs/).

<!-- If you're using The Vault or the toolkit in your research or applications, please cite using this BibTeX:
```bibtex
@misc{,
title={},
author={},
year={2022},
eprint={},
archivePrefix={},
primaryClass={}
}
```-->

Arguments list:
```
positional arguments:
data_path root folder contains .jsonl or file .jsonl itself
## Contact us
If you have any questions, comments or suggestions, please do not hesitate to contact us at [email].

options:
-h, --help show this help message and exit
--save_path SAVE_PATH
Save path
--raw Analysis raw parallel set
--summary
--split_factor SPLIT_FACTOR
Consider factor when splitting, e.g. 'attribute,comment_length'
--merge Merge all .jsonl to 1 individual .jsonl
--deduplicate_factor DEDUPLICATE_FACTOR
Consider factor when splitting, e.g. 'attribute,code_length'
--language LANGUAGE
--load_metadata LOAD_METADATA
--split
--deduplicate Deduplicate
--is_file Source data path is file or dir
--core CORE How many processor to use (-1 if for all)
```
## License
[MIT License](LICENSE.txt)
Binary file added assets/Poster_The Vault.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 48 additions & 2 deletions data/format/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,49 @@
# File format
# Data Fields

Data format of passing data into the extraction functions.
## Function and Class level

- **repo** the owner/repo's name
- **path** the full path to the original file
- **identifier** the function/method's name
- **license** repo's license
- **stars_count** number of repo’s star (nullable)
- **issues_count** number of repo’s issue (nullable)
- **forks_count** number of repo’s fork (nullable)
- **original_string** original code snippet version of function/class node
- **original_docstring** the raw string before tokenization or parsing
- **language** the source programming language
- **code** the part of the "original_string" that is code
- **code_tokens** tokenized version of "code"
- **docstring** the top-level comment or docstring (docstring version without param’s doc, return, exception, etc)
- **docstring_tokens** tokenized version of `docstring`
- **short_docstring** first line of the "docstring"
- **short_docstring_tokens** tokenized version of "short_docstring"
- **comment** List of comment (line) inside the function/class node
- **parameters** Dict of parameter `identifier` and its `type` (`type` is nullable)
- **docstring_params**
- **params** List of dictionary of param's docstring that are describe inside the "docstring". Fields: contains `identifier`, `docstring`, `docstring_tokens`, `type` (nullable), `default` (nullable), `is_optional` (nullable).
- **outlier_params** List of the params which don’t belong to the function declaration*. The syntax is similar with "params".
- **returns** List of returns. Field: `type`, `docstring`, `docstring_tokens`.
- **raises** List of raise/throw. Field: `type`, `docstring`, ` docstring_tokens`.
- **others** List of other type of docstring params (e.g `version``author`, etc). The field's name will be equal to their type's key.

**Notes: Outlier param for example `def cal_sum(a, b):`, if param `c` is describe in docstring, then it is called outlier params.*


## Inline level

- **repo** the owner/repo
- **path** full path to the original file
- **language** the programming language
- **license** repo's license
- **parent_name** name of the method/class parent node
- **original_string** full version of code snippet
- **code** the part of "original_string" that is code
- **code_tokens** tokenized version of code
- **prev_context** the (code) block above the comment
- **next_context** the (code) block below the comment
- **start_point** position of start line, position of start character
- **end_point** position of last line, position of last character
- **original_comment** the original comment before cleaning
- **comment** the cleaned comment
- **comment_tokens** tokenized version of comment
2 changes: 0 additions & 2 deletions data/format/csn-format.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,3 @@ code: original_string
repo: repo
path: path
language: language

# Additional content
3 changes: 0 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,3 @@ bs4
yaml
tqdm
nltk

# parser
codetext
2 changes: 2 additions & 0 deletions src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Processing Pipeline

Loading

0 comments on commit 61c770d

Please sign in to comment.