Skip to content

Commit 54e136c

Browse files
committed
major changes involving readme and all the packaging
1 parent 2dd35f1 commit 54e136c

17 files changed

+410
-22
lines changed

.gitignore

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,6 @@ __pycache__/
33
*.py[cod]
44
*$py.class
55

6-
# C extensions
7-
*.so
86

97
# Distribution / packaging
108
.Python
@@ -135,4 +133,6 @@ dmypy.json
135133

136134
.data/
137135

138-
model/
136+
model/
137+
138+
.python-version.th

.python-version.actual

Lines changed: 0 additions & 1 deletion
This file was deleted.

README.md

Lines changed: 95 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,109 @@
11
# code-bert
2-
BERT/RoBERTa kind of language representation model training and fine tuning
32

4-
We have released the model [here](https://huggingface.co/codistai/codeBERT-small-v2)
3+
This is [CodistAI](https://codist-ai.com/) open source version to easily use the fine tuned model based on our open source MLM code model [codeBERT-small-v2](https://huggingface.co/codistai/codeBERT-small-v2)
54

6-
However, this small python module serves as the pre-tokenization step needed for the tokenizer to deal with code.
5+
[codeBERT-small-v2](https://huggingface.co/codistai/codeBERT-small-v2) is a RoBERTa model, trained using Hugging Face Transformer library and then we have fine tuned the model on the task of predicting the following -
76

7+
Given a function body `f` as a string of code tokens (including special tokens such as `indent` and `dedent`) and a doc string `d` as a string of Natual Language tokens. Predict whether `f` and `d` are assciated or not (meaning, whether they represent the same concept or not)
8+
9+
## An example
10+
11+
Let's consider the following code
12+
13+
```python
14+
from pathlib import Path
15+
16+
def get_file(filename):
17+
"""
18+
opens a url
19+
"""
20+
if not Path(filename).is_file():
21+
return None
22+
return open(filename, "rb")
823

924
```
10-
from code_bert.core.data_reader import process_code
1125

12-
with open("test_files/test_code_get.py") as f:
13-
code = f.read()
26+
Using our another open source library [tree-hugger](https://github.com/autosoft-dev/tree-hugger) it is fairly trivial to get the code and separate out the function body and the docstring with a single API call.
27+
28+
We can use then, the [`process_code`](https://github.com/autosoft-dev/code-bert/blob/2dd35f16fa2cdb96f75e21bb0a9393aa3164d885/code_bert/core/data_reader.py#L136) method from this prsent repo to process the code lines in the proper format as [codeBERT-small-v2](https://huggingface.co/codistai/codeBERT-small-v2) would want.
1429

30+
Doing the above two steps properly would produce something like the following
1531

16-
process_code(code)
32+
- **Function** - `def get file ( filename ) : indent if not path ( filename ) . is file ( ) : indent return none dedent return open ( filename , "rb" ) dedent`
33+
34+
- **Doc String** - `opens a url`
35+
36+
Ideall then we need some model to run the following Pseudocode
37+
38+
```python
39+
match, confidence = model(function, docstring)
1740
```
1841

19-
This will produce a result like this
42+
## code-bert CLI
43+
44+
**The entire code base is built and abvailble for Python3.6+**
45+
46+
We have provided very easy to use CLI commands to achieve all these, and at scale. Let's go through that step by step
47+
48+
**We strongly recommend using a virtual environment for the followinsg steps**
49+
50+
1. First clone this repo - `git clone https://github.com/autosoft-dev/code-bert.git`
51+
52+
2. (Assuming you have the virtualenv activated) Then do `pip install -r requirements.txt`
53+
54+
3. Then install the package with `pip install -e .`
55+
56+
4. First step is to download and set up the model. If the above steps are done properly then there is command for doing this `download_model`
57+
58+
5. The model is almost 1.7G in total, so it may take a bit of time before it finishes.
59+
60+
6. Once this is done, you are ready to analyze code. For that we have a CLI option also. Details of that in the following section
61+
62+
-----------
63+
64+
Assuming that model is downloaded and ready, you can run the following command to analyze one file or a directory containing a bunch of files
65+
66+
```
67+
usage: run_pipeline [-h] [-f FILE_NAME] [-r RECURSIVE]
68+
69+
optional arguments:
70+
-h, --help show this help message and exit
71+
-f FILE_NAME, --file_name FILE_NAME
72+
The name of the file you want to run the pipeline on
73+
-r RECURSIVE, --recursive RECURSIVE
74+
Put the directory if you want to run recursively
75+
```
76+
77+
So, let's say you have a directory called `test_files` with some python files in it. This is how you can analyze them
78+
79+
`run_pipeline -r test_files`
80+
81+
A prompt will appear to confirm the model location. Once you confirm that then the algorithm will take one file at a time and analyze that, recursively on the whole directory.
82+
83+
It should produce a report like the following -
84+
2085

2186
```
22-
Out[4]:
23-
['from pathlib import path',
24-
'def get file ( filename ) : indent',
25-
'if not path ( filename ) . is file ( ) : indent',
26-
'return none',
27-
'dedent return open ( filename , "rb" ) dedent']
87+
======== Analysing test_files/test_code_add.py =========
88+
89+
90+
def add ( a , b ) : indent return a + b dedent
91+
Function "add" with Dcostring """sums two numbers and returns the result"""
92+
Do they match?
93+
Yes
94+
******************************************************************
95+
def return all even ( lst ) : indent if not lst : indent return none dedent return [ a for a in lst if a % 2 == 0 ] dedent
96+
Function "return_all_even" with Dcostring """numbers that are not really odd"""
97+
Do they match?
98+
Yes
99+
******************************************************************
100+
101+
======== Analysing test_files/inner_dir/test_code_get.py =========
102+
103+
104+
def get file ( filename ) : indent if not path ( filename ) . is file ( ) : indent return none dedent return open ( filename , "rb" ) dedent
105+
Function "get_file" with Dcostring """opens a url"""
106+
Do they match?
107+
No
108+
******************************************************************
28109
```

code_bert/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.1.5"
1+
__version__ = "0.2.0"

code_bert/cli/run_pipeline.py

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
import logging
2+
import platform
3+
from pathlib import Path
4+
from argparse import ArgumentParser
5+
6+
from .utils import query_yes_no
7+
from ..core.data_preparation import iter_dir, FileParser
8+
from ..core.prediction import Prediction
9+
10+
LIBS = {"Linux": "libs/linux", "Darwin": "libs/darwin"}
11+
12+
logging.disable(logging.INFO)
13+
14+
is_python_file = lambda x: Path(x).suffix == ".py"
15+
16+
17+
def _my_os():
18+
return platform.system()
19+
20+
21+
def _run_model(file_path, file_parser, predictor):
22+
print(f"\n ======== Analysing {file_path} =========\n\n")
23+
for func_name, func_body, docstr in file_parser.parse_file_and_get_data(file_path):
24+
match, _ = predictor.predict(func_body, docstr)
25+
match_yes = "Yes" if bool(match) == True else "No"
26+
print(func_body)
27+
print(f'Function "{func_name}" with Dcostring """{docstr}"""\nDo they match?\n{match_yes}')
28+
print("******************************************************************")
29+
30+
31+
def run_pipeline(args):
32+
if args.file_name and args.recursive:
33+
raise Exception("\n\nCan not mention both a single file and a directory.\n Either of them"
34+
)
35+
36+
if not Path("Model").exists() or not Path("Model").is_dir():
37+
raise Exception("\n\nEither the Model directory does not exist or it is invalid")
38+
39+
choice = query_yes_no("We believe that the model is at 'Model' directory. Shall we continue?")
40+
41+
if choice:
42+
print("Loading model")
43+
predictor = Prediction("Model")
44+
print("Model loaded")
45+
os_version = _my_os()
46+
lib = LIBS.get(os_version)
47+
if not lib:
48+
raise Exception(f"\n\nYour version of OS {os_version} is not supported yet!")
49+
lib = f"{lib}/my-languages.so"
50+
query_file = "queries/queries.yml"
51+
52+
fp = FileParser(lib, query_file)
53+
54+
if args.recursive:
55+
for file_path in iter_dir(args.recursive):
56+
if is_python_file(file_path):
57+
_run_model(file_path, fp, predictor)
58+
else:
59+
if is_python_file(file_path):
60+
_run_model(file_path, fp, predictor)
61+
else:
62+
print("Bye Bye!")
63+
64+
65+
66+
def main():
67+
parser = ArgumentParser()
68+
parser.add_argument("-f", "--file_name", type=str, required=False, help="The name of the file you want to run the pipeline on")
69+
parser.add_argument("-r", "--recursive", required=False, help="Put the directory if you want to run recursively")
70+
71+
args = parser.parse_args()
72+
run_pipeline(args)

code_bert/cli/utils.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
import sys
2+
3+
4+
def query_yes_no(question, default="yes"):
5+
"""Ask a yes/no question via raw_input() and return their answer.
6+
7+
"question" is a string that is presented to the user.
8+
"default" is the presumed answer if the user just hits <Enter>.
9+
It must be "yes" (the default), "no" or None (meaning
10+
an answer is required of the user).
11+
12+
The "answer" return value is True for "yes" or False for "no".
13+
"""
14+
valid = {"yes": True, "y": True, "ye": True,
15+
"no": False, "n": False}
16+
if default is None:
17+
prompt = " [y/n] "
18+
elif default == "yes":
19+
prompt = " [Y/n] "
20+
elif default == "no":
21+
prompt = " [y/N] "
22+
else:
23+
raise ValueError("invalid default answer: '%s'" % default)
24+
25+
while True:
26+
sys.stdout.write(question + prompt)
27+
choice = input().lower()
28+
if default is not None and choice == '':
29+
return valid[default]
30+
elif choice in valid:
31+
return valid[choice]
32+
else:
33+
sys.stdout.write("Please respond with 'yes' or 'no' "
34+
"(or 'y' or 'n').\n")

code_bert/core/data_preparation.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
import os
2+
3+
from tree_hugger.core import PythonParser
4+
from .data_reader import process_code
5+
6+
7+
def iter_dir(dir_name):
8+
for root, _ ,f_names in os.walk(dir_name):
9+
for f in f_names:
10+
yield os.path.join(root, f)
11+
12+
13+
class FileParser(object):
14+
15+
def __init__(self, lib_location, query_file_location):
16+
self.pp = PythonParser(lib_location, query_file_location)
17+
18+
def _combine_lines(self, logical_lines):
19+
c = " ".join(logical_lines)
20+
c = c.split()
21+
return " ".join(c) if len(c) < 256 else " ".join(c[:256])
22+
23+
24+
def parse_file_and_get_data(self, file_path):
25+
if not self.pp.parse_file(file_path):
26+
raise Exception(f"\n\nCould not parse file {file_path}")
27+
func_name_and_doc_str = self.pp.get_all_function_docstrings(strip_quotes=True)
28+
func_name_and_body = self.pp.get_all_function_bodies(strip_docstr=True)
29+
30+
for fname, docstr in func_name_and_doc_str.items():
31+
if func_name_and_body.get(fname):
32+
func_body, _ = func_name_and_body[fname]
33+
logical_lines = process_code(func_body)
34+
combined_lines = self._combine_lines(logical_lines)
35+
yield fname, combined_lines, docstr.split("\n")[0]

code_bert/core/prediction.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
from uuid import uuid4
2+
from transformers import *
3+
import numpy as np
4+
import torch
5+
6+
7+
class Prediction():
8+
9+
def __init__(self, model_path, model_type="QQP"):
10+
self.model_path = f"./{model_path}/{model_type}"
11+
12+
self.tokenzier = AutoTokenizer.from_pretrained(self.model_path)
13+
self.model = RobertaForSequenceClassification.from_pretrained(self.model_path)
14+
15+
self.processor = glue_processors['qqp']()
16+
self.output_mode = glue_output_modes['qqp']
17+
self.label_list = self.processor.get_labels()
18+
19+
def _predict(self, example):
20+
features = glue_convert_examples_to_features(example,
21+
self.tokenzier,
22+
max_length=512,
23+
label_list=self.label_list,
24+
output_mode=self.output_mode)
25+
labels = torch.tensor([1]).unsqueeze(0)
26+
with torch.no_grad():
27+
output = self.model(torch.tensor(features[0].input_ids).unsqueeze(0), labels=labels)
28+
loss = output[0].numpy()
29+
match = np.argmax(output[1].numpy())
30+
return match, loss
31+
32+
def predict(self, func_body, doc_str):
33+
guid = "test_0"
34+
example = [InputExample(guid=guid, text_a=func_body, text_b=doc_str, label=None)]
35+
return self._predict(example)

libs/.DS_Store

6 KB
Binary file not shown.

libs/darwin/my-languages.so

215 KB
Binary file not shown.

0 commit comments

Comments
 (0)