autosoft-dev
diff --git a/‎.gitignore
Lines changed: 3 additions & 3 deletions b/‎.gitignore
Lines changed: 3 additions & 3 deletions
diff --git a/‎.python-version.actual
Lines changed: 0 additions & 1 deletion b/‎.python-version.actual
Lines changed: 0 additions & 1 deletion
diff --git a/‎README.md
Lines changed: 95 additions & 14 deletions b/‎README.md
Lines changed: 95 additions & 14 deletions
diff --git a/‎code_bert/__init__.py
Lines changed: 1 addition & 1 deletion b/‎code_bert/__init__.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎code_bert/cli/run_pipeline.py
Lines changed: 72 additions & 0 deletions b/‎code_bert/cli/run_pipeline.py
Lines changed: 72 additions & 0 deletions
diff --git a/‎code_bert/cli/utils.py
Lines changed: 34 additions & 0 deletions b/‎code_bert/cli/utils.py
Lines changed: 34 additions & 0 deletions
diff --git a/‎code_bert/core/data_preparation.py
Lines changed: 35 additions & 0 deletions b/‎code_bert/core/data_preparation.py
Lines changed: 35 additions & 0 deletions
diff --git a/‎code_bert/core/prediction.py
Lines changed: 35 additions & 0 deletions b/‎code_bert/core/prediction.py
Lines changed: 35 additions & 0 deletions
diff --git a/‎libs/.DS_Store
6 KB b/‎libs/.DS_Store
6 KB
diff --git a/‎libs/darwin/my-languages.so
215 KB b/‎libs/darwin/my-languages.so
215 KB
@@ -3,8 +3,6 @@ __pycache__/
 *.py[cod]
 *$py.class
 
-# C extensions
-*.so
 
 # Distribution / packaging
 .Python
@@ -135,4 +133,6 @@ dmypy.json
 
 .data/
 
-model/
+model/
+
+.python-version.th
@@ -1,28 +1,109 @@
 # code-bert
-BERT/RoBERTa kind of language representation model training and fine tuning
 
-We have released the model [here](https://huggingface.co/codistai/codeBERT-small-v2)
+This is [CodistAI](https://codist-ai.com/) open source version to easily use the fine tuned model based on our open source MLM code model [codeBERT-small-v2](https://huggingface.co/codistai/codeBERT-small-v2)
 
-However, this small python module serves as the pre-tokenization step needed for the tokenizer to deal with code.
+[codeBERT-small-v2](https://huggingface.co/codistai/codeBERT-small-v2) is a RoBERTa model, trained using Hugging Face Transformer library and then we have fine tuned the model on the task of predicting the following - 
 
+Given a function body `f` as a string of code tokens (including special tokens such as `indent` and `dedent`) and a doc string `d` as a string of Natual Language tokens. Predict whether `f` and `d` are assciated or not (meaning, whether they represent the same concept or not)
+
+## An example
+
+Let's consider the following code
+
+```python
+from pathlib import Path
+
+def get_file(filename):
+    """
+    opens a url
+    """
+    if not Path(filename).is_file():
+        return None
+    return open(filename, "rb")
 
 ```
-from code_bert.core.data_reader import process_code
 
-with open("test_files/test_code_get.py") as f:
-    code = f.read()
+Using our another open source library [tree-hugger](https://github.com/autosoft-dev/tree-hugger) it is fairly trivial to get the code and separate out the function body and the docstring with a single API call. 
+
+We can use then, the [`process_code`](https://github.com/autosoft-dev/code-bert/blob/2dd35f16fa2cdb96f75e21bb0a9393aa3164d885/code_bert/core/data_reader.py#L136) method from this prsent repo to process the code lines in the proper format as [codeBERT-small-v2](https://huggingface.co/codistai/codeBERT-small-v2) would want.
 
+Doing the above two steps properly would produce something like the following
 
-process_code(code)
+- **Function** - `def get file ( filename ) : indent if not path ( filename ) . is file ( ) : indent return none dedent return open ( filename , "rb" ) dedent`
+
+- **Doc String** - `opens a url`
+
+Ideall then we need some model to run the following Pseudocode
+
+```python
+match, confidence = model(function, docstring)
 ```
 
-This will produce a result like this 
+## code-bert CLI
+
+**The entire code base is built and abvailble for Python3.6+**
+
+We have provided very easy to use CLI commands to achieve all these, and at scale. Let's go through that step by step
+
+**We strongly recommend using a virtual environment for the followinsg steps** 
+
+1. First clone this repo - `git clone https://github.com/autosoft-dev/code-bert.git`
+
+2. (Assuming you have the virtualenv activated) Then do `pip install -r requirements.txt`
+
+3. Then install the package with `pip install -e .`
+
+4. First step is to download and set up the model. If the above steps are done properly then there is command for doing this `download_model`
+
+5. The model is almost 1.7G in total, so it may take a bit of time before it finishes.
+
+6. Once this is done, you are ready to analyze code. For that we have a CLI option also. Details of that in the following section
+
+-----------
+
+Assuming that model is downloaded and ready, you can run the following command to analyze one file or a directory containing a bunch of files
+
+```
+usage: run_pipeline [-h] [-f FILE_NAME] [-r RECURSIVE]
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -f FILE_NAME, --file_name FILE_NAME
+                        The name of the file you want to run the pipeline on
+  -r RECURSIVE, --recursive RECURSIVE
+                        Put the directory if you want to run recursively
+```
+
+So, let's say you have a directory called `test_files` with some python files in it. This is how you can analyze them
+
+`run_pipeline  -r test_files`
+
+A prompt will appear to confirm the model location. Once you confirm that then the algorithm will take one file at a time and analyze that, recursively on the whole directory. 
+
+It should produce a report like the following - 
+
 
 ```
-Out[4]:
-['from pathlib import path',
- 'def get file ( filename ) : indent',
- 'if not path ( filename ) . is file ( ) : indent',
- 'return none',
- 'dedent return open ( filename , "rb" ) dedent']
+ ======== Analysing test_files/test_code_add.py =========
+
+
+def add ( a , b ) : indent return a + b dedent
+Function "add" with Dcostring """sums two numbers and returns the result"""
+Do they match?
+Yes
+******************************************************************
+def return all even ( lst ) : indent if not lst : indent return none dedent return [ a for a in lst if a % 2 == 0 ] dedent
+Function "return_all_even" with Dcostring """numbers that are not really odd"""
+Do they match?
+Yes
+******************************************************************
+
+ ======== Analysing test_files/inner_dir/test_code_get.py =========
+
+
+def get file ( filename ) : indent if not path ( filename ) . is file ( ) : indent return none dedent return open ( filename , "rb" ) dedent
+Function "get_file" with Dcostring """opens a url"""
+Do they match?
+No
+******************************************************************
 ```
@@ -1 +1 @@
-__version__ = "0.1.5"
+__version__ = "0.2.0"
@@ -0,0 +1,72 @@
+import logging
+import platform
+from pathlib import Path
+from argparse import ArgumentParser
+
+from .utils import query_yes_no
+from ..core.data_preparation import iter_dir, FileParser
+from ..core.prediction import Prediction
+
+LIBS = {"Linux": "libs/linux", "Darwin": "libs/darwin"}
+
+logging.disable(logging.INFO)
+
+is_python_file = lambda x: Path(x).suffix == ".py"
+
+
+def _my_os():
+    return platform.system()
+
+
+def _run_model(file_path, file_parser, predictor):
+    print(f"\n ======== Analysing {file_path} =========\n\n")
+    for func_name, func_body, docstr in file_parser.parse_file_and_get_data(file_path):
+        match, _ = predictor.predict(func_body, docstr)
+        match_yes = "Yes" if bool(match) == True else "No"
+        print(func_body)
+        print(f'Function "{func_name}" with Dcostring """{docstr}"""\nDo they match?\n{match_yes}')
+        print("******************************************************************")
+
+
+def run_pipeline(args):
+    if args.file_name and args.recursive:
+        raise Exception("\n\nCan not mention both a single file and a directory.\n Either of them"
+        )
+    
+    if not Path("Model").exists() or not Path("Model").is_dir():
+        raise Exception("\n\nEither the Model directory does not exist or it is invalid")
+
+    choice = query_yes_no("We believe that the model is at 'Model' directory. Shall we continue?")
+
+    if choice:
+        print("Loading model")
+        predictor = Prediction("Model")
+        print("Model loaded")
+        os_version = _my_os()
+        lib = LIBS.get(os_version)
+        if not lib:
+            raise Exception(f"\n\nYour version of OS {os_version} is not supported yet!")
+        lib = f"{lib}/my-languages.so"
+        query_file = "queries/queries.yml"
+
+        fp = FileParser(lib, query_file)
+
+        if args.recursive:
+            for file_path in iter_dir(args.recursive):
+                if is_python_file(file_path):
+                    _run_model(file_path, fp, predictor)     
+        else:
+            if is_python_file(file_path):
+                _run_model(file_path, fp, predictor)
+    else:
+        print("Bye Bye!")
+
+
+
+def main():
+    parser = ArgumentParser()
+    parser.add_argument("-f", "--file_name", type=str, required=False, help="The name of the file you want to run the pipeline on")
+    parser.add_argument("-r", "--recursive", required=False, help="Put the directory if you want to run recursively")
+
+    args = parser.parse_args()
+    run_pipeline(args)
@@ -0,0 +1,34 @@
+import sys
+
+
+def query_yes_no(question, default="yes"):
+    """Ask a yes/no question via raw_input() and return their answer.
+
+    "question" is a string that is presented to the user.
+    "default" is the presumed answer if the user just hits <Enter>.
+        It must be "yes" (the default), "no" or None (meaning
+        an answer is required of the user).
+
+    The "answer" return value is True for "yes" or False for "no".
+    """
+    valid = {"yes": True, "y": True, "ye": True,
+             "no": False, "n": False}
+    if default is None:
+        prompt = " [y/n] "
+    elif default == "yes":
+        prompt = " [Y/n] "
+    elif default == "no":
+        prompt = " [y/N] "
+    else:
+        raise ValueError("invalid default answer: '%s'" % default)
+
+    while True:
+        sys.stdout.write(question + prompt)
+        choice = input().lower()
+        if default is not None and choice == '':
+            return valid[default]
+        elif choice in valid:
+            return valid[choice]
+        else:
+            sys.stdout.write("Please respond with 'yes' or 'no' "
+                             "(or 'y' or 'n').\n")
@@ -0,0 +1,35 @@
+import os
+
+from tree_hugger.core import PythonParser
+from .data_reader import process_code
+
+
+def iter_dir(dir_name):
+    for root, _ ,f_names in os.walk(dir_name):
+        for f in f_names:
+            yield os.path.join(root, f)
+
+
+class FileParser(object):
+
+    def __init__(self, lib_location, query_file_location):
+        self.pp = PythonParser(lib_location, query_file_location)
+    
+    def _combine_lines(self, logical_lines):
+        c = " ".join(logical_lines)
+        c = c.split()
+        return " ".join(c) if len(c) < 256 else " ".join(c[:256])
+
+    
+    def parse_file_and_get_data(self, file_path):
+        if not self.pp.parse_file(file_path):
+            raise Exception(f"\n\nCould not parse file {file_path}")
+        func_name_and_doc_str = self.pp.get_all_function_docstrings(strip_quotes=True)
+        func_name_and_body = self.pp.get_all_function_bodies(strip_docstr=True)
+        
+        for fname, docstr in func_name_and_doc_str.items():
+            if func_name_and_body.get(fname):
+                func_body, _ = func_name_and_body[fname]
+                logical_lines = process_code(func_body)
+                combined_lines = self._combine_lines(logical_lines)
+                yield fname, combined_lines, docstr.split("\n")[0]
@@ -0,0 +1,35 @@
+from uuid import uuid4
+from transformers import *
+import numpy as np
+import torch
+
+
+class Prediction():
+
+    def __init__(self, model_path, model_type="QQP"):
+        self.model_path = f"./{model_path}/{model_type}"
+        
+        self.tokenzier = AutoTokenizer.from_pretrained(self.model_path)
+        self.model = RobertaForSequenceClassification.from_pretrained(self.model_path)
+
+        self.processor = glue_processors['qqp']()
+        self.output_mode = glue_output_modes['qqp']
+        self.label_list = self.processor.get_labels()
+
+    def _predict(self, example):
+        features = glue_convert_examples_to_features(example,
+                                                     self.tokenzier,
+                                                     max_length=512,
+                                                     label_list=self.label_list,
+                                                     output_mode=self.output_mode)
+        labels = torch.tensor([1]).unsqueeze(0)
+        with torch.no_grad():
+            output = self.model(torch.tensor(features[0].input_ids).unsqueeze(0), labels=labels)
+            loss = output[0].numpy()
+            match = np.argmax(output[1].numpy())
+            return match, loss
+
+    def predict(self, func_body, doc_str):
+        guid = "test_0"
+        example = [InputExample(guid=guid, text_a=func_body, text_b=doc_str, label=None)]
+        return self._predict(example)
Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = "0.1.5"`
	`1`	`+__version__ = "0.2.0"`