Refactor & update readme

FSoft-AI4Code · May 4, 2023 · 61c770d · 61c770d
1 parent b56498b
commit 61c770d
Show file tree

Hide file tree

Showing 21 changed files with 184 additions and 3,143 deletions.
diff --git a/README.md b/README.md
@@ -4,8 +4,8 @@
   <img src="https://avatars.githubusercontent.com/u/115590550?s=200&v=4" width="220px" alt="logo">
 </p>
 
-**Code-Text data toolkit**
-______________________________________________________________________
+**The Vault: Open source parallel data extractor**
+__________________________
 
 
 <!-- Badge start -->
@@ -15,69 +15,96 @@ ______________________________________________________________________
 <!-- Badge end -->
 </div>
 
-______________________________________________________________________
+# Relevant Links
+[The Vault paper](https://arxiv.org) | [The Vault on HuggingFace datasets](https://huggingface.co/datasets?search) <img alt="Hugging Face Datasets" src="https://img.shields.io/badge/-%F0%9F%A4%97%20datasets-blue"> </a >
 
-**Code-Text data toolkit** contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level). 
+__________________
+# Table of content
+1. [The Vault](#the-vault)
+  i. [Data Summary](#data-summary)
+  ii. [Data Structure](#data-structure)
+  iii. [Data Split](#data-split)
+2. [CodeText toolkit](#codetext-toolkit)
+  i. [Installation](#installation)
+  ii. [Processing Pipeline](#processing-pipeline)
+  iii. [Processing Custom Dataset](#processing-custom-dataset)
+
+___________
+# The Vault Dataset
+## Data Summary
+The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.
+
+We design The Vault to extract code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.
+
+![Something something](./assets/Poster_The%20Vault.jpg)
+## Data Structure
+
+## Data Split
+
+## Load dataset
+We support load our dataset via Huggingface datasets hub:
+
+```python
+!pip install datasets
+
+from datasets import load_dataset
+
+# Load full function dataset (40M samples)
+ds = load_dataset("NamCyan/thevault", split="function")
+
+# Load function "small" trainset (or "medium", "large") 
+ds = load_dataset("NamCyan/thevault", split="function/train_small")
+
+# Load only function testset
+ds = load_dataset("NamCyan/thevault", split="function/test")
+
+# specific language (e.g. Golang) 
+ds = load_dataset("NamCyan/thevault", split="function/train", languages=['Go'])
+
+# streaming load (that will only download the data as needed)
+ds = load_dataset("NamCyan/thevault", split="function/train", streaming=True)
 
-# Installation
-Setup environment and install dependencies and setup by using `install_env.sh`
-```bash
-bash -i ./install_env.sh
 ```
-then activate conda environment named "code-text-env"
+# The Vault Toolkit
+## Getting Started
+
+To setup environment and install dependencies via `pip`:
 ```bash
-conda activate code-text-env
+pip -r install requirements.txt
 ```
 
-*Setup for using parser*
+Install `codetext` parser to extract code using [tree-sitter](https://tree-sitter.github.io/tree-sitter/), via `pip`:
 ```bash
 pip install codetext
 ```
 
-# Getting started
-
-## Build your language
-Auto build tree-sitter into `<language>.so` located in `/tree-sitter/`
-```python
-from codetext.utils import build_language
-
-language = 'rust'
-build_language(language)
+Or manually build `codetext` form source, see more at [`Codetext` repo](https://github.com/FSoft-AI4Code/CodeText-parser)
+```bash
+git clone https://github.com/FSoft-AI4Code/CodeText-parser.git
+cd CodeText-parser
+pip install -e .
+```
 
+## Processing Pipeline
 
-# INFO:utils:Not found tree-sitter-rust, attempt clone from github
-# Cloning into 'tree-sitter-rust'...
-# remote: Enumerating objects: 2835, done. ...
-# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so
-```
+### Extracting raw code
 
-Parse code to `tree-sitter.Tree`
-```python
-from codetext.utils import parse_code
-
-raw_code = """
-/**
-* Sum of 2 number
-* @param a int number
-* @param b int number
-*/
-double sum2num(int a, int b) {
-    return a + b;
-}
-"""
+### Filtering extracted code snippet
 
-root = parse_code(raw_code, 'cpp')
-root_node = root.root_node
-```
+### Processing Custom Dataset
+We create a `.yaml` to define which field to load when processing data. Usually, only source code are needed, but in case there are other additional information about the raw code might be added using the `.yaml`.
 
-# Data collection and Preprocessing
-The dataset we used to extract was collected by codeparrot. They host the raw dataset in here [codeparrot/github-code](https://huggingface.co/datasets/codeparrot/github-code).
+For example, `CodeSearchNet` stores their data in structure:
 
-*You can create your own dataset using Google Bigquery and the [query here](https://huggingface.co/datasets/codeparrot/github-code/blob/main/query.sql)*
+```yaml
+# CodeSearchNet jsonline format 
+# https://github.com/github/CodeSearchNet#data-details
 
-## Getting started
-### Process custom dataset
-For start preprocessing data, define a .yaml file to declare raw data format. (More detail: `/data/format/README.md`)
+code: original_string # raw code
+repo: repo # additional infor
+path: path # additional infor
+language: language # additional infor
+```
 
 ```bash
 python -m codetext.processing 
@@ -115,41 +142,23 @@ options:
   --debug
 ```
 
-### Analyse and split dataset
-The code process is going to save cleaned sample by batch, you can merge it using `postprocess.py`. We also provide analyse tool for get total number of sample, blank_line(\*), comment(\*) and code(\*). You can also split your dataset into `train`, `valid`, `test`.
-
-```bash
-python -m codetext.postprocessing 
-<DATASET_PATH>  # path to dir contains /extracted, /filered, /raw
---save_path <SAVE_PATH>  # path to save final output
-
---n_core 10  # number of core for multiprocessing analyzer
---analyze  # Analyze trigger
---split  # Split train/test/valid trigger
---ratio 0.05  # Test and valid ratio (defaul to equal)
---max_sample 20000  # Max size of test set and valid set
-```
+## Technical Report and Citing the Vault
+More details can be found in our [technical report](https://arxiv.org/abs/). 
+
+<!-- If you're using The Vault or the toolkit in your research or applications, please cite using this BibTeX:
+```bibtex
+@misc{,
+      title={}, 
+      author={},
+      year={2022},
+      eprint={},
+      archivePrefix={},
+      primaryClass={}
+}
+```-->
 
-Arguments list:
-```
-positional arguments:
-  data_path             root folder contains .jsonl or file .jsonl itself
+## Contact us
+If you have any questions, comments or suggestions, please do not hesitate to contact us at [email].
 
-options:
-  -h, --help            show this help message and exit
-  --save_path SAVE_PATH
-                        Save path
-  --raw                 Analysis raw parallel set
-  --summary
-  --split_factor SPLIT_FACTOR
-                        Consider factor when splitting, e.g. 'attribute,comment_length'
-  --merge               Merge all .jsonl to 1 individual .jsonl
-  --deduplicate_factor DEDUPLICATE_FACTOR
-                        Consider factor when splitting, e.g. 'attribute,code_length'
-  --language LANGUAGE
-  --load_metadata LOAD_METADATA
-  --split
-  --deduplicate         Deduplicate
-  --is_file             Source data path is file or dir
-  --core CORE           How many processor to use (-1 if for all)
-```
+## License
+[MIT License](LICENSE.txt)
diff --git a/assets/Poster_The Vault.jpg b/assets/Poster_The Vault.jpg
diff --git a/data/format/README.md b/data/format/README.md
@@ -1,3 +1,49 @@
-# File format
+# Data Fields
 
-Data format of passing data into the extraction functions.
+## Function and Class level
+
+- **repo** the owner/repo's name
+- **path** the full path to the original file
+- **identifier** the function/method's name
+- **license** repo's license
+- **stars_count** number of repo’s star (nullable)
+- **issues_count** number of repo’s issue (nullable)
+- **forks_count** number of repo’s fork (nullable)
+- **original_string** original code snippet version of function/class node
+- **original_docstring** the raw string before tokenization or parsing
+- **language** the source programming language
+- **code** the part of the "original_string" that is code
+- **code_tokens** tokenized version of "code"
+- **docstring** the top-level comment or docstring (docstring version without param’s doc, return, exception, etc)
+- **docstring_tokens** tokenized version of `docstring`
+- **short_docstring** first line of the "docstring"
+- **short_docstring_tokens** tokenized version of "short_docstring"
+- **comment** List of comment (line) inside the function/class node
+- **parameters** Dict of parameter `identifier` and its `type` (`type` is nullable)
+- **docstring_params**
+    - **params** List of dictionary of param's docstring that are describe inside the "docstring". Fields: contains `identifier`, `docstring`, `docstring_tokens`, `type` (nullable), `default` (nullable), `is_optional` (nullable).
+    - **outlier_params** List of the params which don’t belong to the function declaration*. The syntax is similar with "params".
+    - **returns** List of returns. Field: `type`, `docstring`, `docstring_tokens`.
+    - **raises** List of raise/throw. Field: `type`, `docstring`, ` docstring_tokens`.
+    - **others** List of other type of docstring params (e.g `version`, `author`, etc). The field's name will be equal to their type's key.
+
+**Notes: Outlier param for example `def cal_sum(a, b):`, if param `c` is describe in docstring, then it is called outlier params.*
+
+
+## Inline level
+
+- **repo** the owner/repo
+- **path** full path to the original file
+- **language** the programming language
+- **license** repo's license
+- **parent_name** name of the method/class parent node
+- **original_string** full version of code snippet
+- **code** the part of "original_string" that is code
+- **code_tokens** tokenized version of code
+- **prev_context** the (code) block above the comment
+- **next_context** the (code) block below the comment
+- **start_point** position of start line, position of start character
+- **end_point** position of last line, position of last character
+- **original_comment** the original comment before cleaning
+- **comment** the cleaned comment
+- **comment_tokens** tokenized version of comment
diff --git a/data/format/csn-format.yaml b/data/format/csn-format.yaml
@@ -6,5 +6,3 @@ code: original_string
 repo: repo
 path: path
 language: language
-
-# Additional content
diff --git a/requirements.txt b/requirements.txt
@@ -13,6 +13,3 @@ bs4
 yaml
 tqdm
 nltk
-
-# parser
-codetext
diff --git a/src/README.md b/src/README.md
@@ -0,0 +1,2 @@
+# Processing Pipeline
+
-Original file line number
+Diff line change
@@ Expand Up / @@ -13,6 +13,3 @@ bs4 @@
     yaml
     tqdm
     nltk
-    # parser
-    codetext