|
1 | 1 | # code-bert
|
2 |
| -BERT/RoBERTa kind of language representation model training and fine tuning |
3 | 2 |
|
4 |
| -We have released the model [here](https://huggingface.co/codistai/codeBERT-small-v2) |
| 3 | +This is [CodistAI](https://codist-ai.com/) open source version to easily use the fine tuned model based on our open source MLM code model [codeBERT-small-v2](https://huggingface.co/codistai/codeBERT-small-v2) |
5 | 4 |
|
6 |
| -However, this small python module serves as the pre-tokenization step needed for the tokenizer to deal with code. |
| 5 | +[codeBERT-small-v2](https://huggingface.co/codistai/codeBERT-small-v2) is a RoBERTa model, trained using Hugging Face Transformer library and then we have fine tuned the model on the task of predicting the following - |
7 | 6 |
|
| 7 | +Given a function body `f` as a string of code tokens (including special tokens such as `indent` and `dedent`) and a doc string `d` as a string of Natual Language tokens. Predict whether `f` and `d` are assciated or not (meaning, whether they represent the same concept or not) |
| 8 | + |
| 9 | +## An example |
| 10 | + |
| 11 | +Let's consider the following code |
| 12 | + |
| 13 | +```python |
| 14 | +from pathlib import Path |
| 15 | + |
| 16 | +def get_file(filename): |
| 17 | + """ |
| 18 | + opens a url |
| 19 | + """ |
| 20 | + if not Path(filename).is_file(): |
| 21 | + return None |
| 22 | + return open(filename, "rb") |
8 | 23 |
|
9 | 24 | ```
|
10 |
| -from code_bert.core.data_reader import process_code |
11 | 25 |
|
12 |
| -with open("test_files/test_code_get.py") as f: |
13 |
| - code = f.read() |
| 26 | +Using our another open source library [tree-hugger](https://github.com/autosoft-dev/tree-hugger) it is fairly trivial to get the code and separate out the function body and the docstring with a single API call. |
| 27 | + |
| 28 | +We can use then, the [`process_code`](https://github.com/autosoft-dev/code-bert/blob/2dd35f16fa2cdb96f75e21bb0a9393aa3164d885/code_bert/core/data_reader.py#L136) method from this prsent repo to process the code lines in the proper format as [codeBERT-small-v2](https://huggingface.co/codistai/codeBERT-small-v2) would want. |
14 | 29 |
|
| 30 | +Doing the above two steps properly would produce something like the following |
15 | 31 |
|
16 |
| -process_code(code) |
| 32 | +- **Function** - `def get file ( filename ) : indent if not path ( filename ) . is file ( ) : indent return none dedent return open ( filename , "rb" ) dedent` |
| 33 | + |
| 34 | +- **Doc String** - `opens a url` |
| 35 | + |
| 36 | +Ideall then we need some model to run the following Pseudocode |
| 37 | + |
| 38 | +```python |
| 39 | +match, confidence = model(function, docstring) |
17 | 40 | ```
|
18 | 41 |
|
19 |
| -This will produce a result like this |
| 42 | +## code-bert CLI |
| 43 | + |
| 44 | +**The entire code base is built and abvailble for Python3.6+** |
| 45 | + |
| 46 | +We have provided very easy to use CLI commands to achieve all these, and at scale. Let's go through that step by step |
| 47 | + |
| 48 | +**We strongly recommend using a virtual environment for the followinsg steps** |
| 49 | + |
| 50 | +1. First clone this repo - `git clone https://github.com/autosoft-dev/code-bert.git` |
| 51 | + |
| 52 | +2. (Assuming you have the virtualenv activated) Then do `pip install -r requirements.txt` |
| 53 | + |
| 54 | +3. Then install the package with `pip install -e .` |
| 55 | + |
| 56 | +4. First step is to download and set up the model. If the above steps are done properly then there is command for doing this `download_model` |
| 57 | + |
| 58 | +5. The model is almost 1.7G in total, so it may take a bit of time before it finishes. |
| 59 | + |
| 60 | +6. Once this is done, you are ready to analyze code. For that we have a CLI option also. Details of that in the following section |
| 61 | + |
| 62 | +----------- |
| 63 | + |
| 64 | +Assuming that model is downloaded and ready, you can run the following command to analyze one file or a directory containing a bunch of files |
| 65 | + |
| 66 | +``` |
| 67 | +usage: run_pipeline [-h] [-f FILE_NAME] [-r RECURSIVE] |
| 68 | +
|
| 69 | +optional arguments: |
| 70 | + -h, --help show this help message and exit |
| 71 | + -f FILE_NAME, --file_name FILE_NAME |
| 72 | + The name of the file you want to run the pipeline on |
| 73 | + -r RECURSIVE, --recursive RECURSIVE |
| 74 | + Put the directory if you want to run recursively |
| 75 | +``` |
| 76 | + |
| 77 | +So, let's say you have a directory called `test_files` with some python files in it. This is how you can analyze them |
| 78 | + |
| 79 | +`run_pipeline -r test_files` |
| 80 | + |
| 81 | +A prompt will appear to confirm the model location. Once you confirm that then the algorithm will take one file at a time and analyze that, recursively on the whole directory. |
| 82 | + |
| 83 | +It should produce a report like the following - |
| 84 | + |
20 | 85 |
|
21 | 86 | ```
|
22 |
| -Out[4]: |
23 |
| -['from pathlib import path', |
24 |
| - 'def get file ( filename ) : indent', |
25 |
| - 'if not path ( filename ) . is file ( ) : indent', |
26 |
| - 'return none', |
27 |
| - 'dedent return open ( filename , "rb" ) dedent'] |
| 87 | + ======== Analysing test_files/test_code_add.py ========= |
| 88 | +
|
| 89 | +
|
| 90 | +def add ( a , b ) : indent return a + b dedent |
| 91 | +Function "add" with Dcostring """sums two numbers and returns the result""" |
| 92 | +Do they match? |
| 93 | +Yes |
| 94 | +****************************************************************** |
| 95 | +def return all even ( lst ) : indent if not lst : indent return none dedent return [ a for a in lst if a % 2 == 0 ] dedent |
| 96 | +Function "return_all_even" with Dcostring """numbers that are not really odd""" |
| 97 | +Do they match? |
| 98 | +Yes |
| 99 | +****************************************************************** |
| 100 | +
|
| 101 | + ======== Analysing test_files/inner_dir/test_code_get.py ========= |
| 102 | +
|
| 103 | +
|
| 104 | +def get file ( filename ) : indent if not path ( filename ) . is file ( ) : indent return none dedent return open ( filename , "rb" ) dedent |
| 105 | +Function "get_file" with Dcostring """opens a url""" |
| 106 | +Do they match? |
| 107 | +No |
| 108 | +****************************************************************** |
28 | 109 | ```
|
0 commit comments