Skip to content

Commit

Permalink
Merge pull request #16 from SasCezar/dev
Browse files Browse the repository at this point in the history
Updated dependencies and fixed interface changes in tree-sitter
  • Loading branch information
SasCezar authored Oct 6, 2024
2 parents 9c8b538 + ac109c6 commit 6782ecf
Show file tree
Hide file tree
Showing 21 changed files with 229 additions and 265 deletions.
181 changes: 74 additions & 107 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,77 @@

# AutoFL

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![DOI](https://zenodo.org/badge/644095707.svg)](https://zenodo.org/doi/10.5281/zenodo.10255367)
[![Docker](https://img.shields.io/badge/Docker-blue.svg)](https://img.shields.io/badge/Docker-blue)
[![Docker](https://img.shields.io/badge/Docker-blue.svg)](https://hub.docker.com/r/cezarsas/autofl/)

Automatic source code file annotation using weak labeling.

Automatic source code file annotation using weak labelling.
## Overview

AutoFL is a tool designed for automatic annotation of source code files through weak labeling techniques. It provides both an API and a web-based UI for easy analysis of projects across different languages.

## Setup

Clone the repository and the UI submodule [autofl-ui](https://github.com/SasCezar/autofl-ui) by running the following
command:
To set up the repository along with its UI submodule, clone it using:

```bash
git clone --recursive git@github.com:SasCezar/AutoFL.git AutoFL
```

### Optional Setup
### Optional Model Setup

To make use of certain feature like semantic based labelling functions, you need to download the model.
For example, for **w2v-so**, you can download the model from [here](https://github.com/vefstathiou/SO_word2vec), and
place it in the [data/models/w2v-so](data/models/w2v-so) folder, or a custom
path that you can use in the configs.
For advanced features like semantic-based labeling, download models as required. For example, to use **w2v-so**, download the model from [here](https://github.com/vefstathiou/SO_word2vec) and place it in the `data/models/w2v-so` folder. Alternatively, you can provide a custom path in the configuration files.

## Usage

Run docker compose in the project folder (where the [docker-compose.yaml](docker-compose.yaml) is located) by executing:
To run the tool using Docker, navigate to the project directory (where the `docker-compose.yaml` file is located) and execute:

```shell
docker compose up
```

### API Endpoint

You can analyze the files of project by making a request to the endpoint:
To analyze the files of a project, make a POST request to the following endpoint:

```shell
curl -X POST -d '{"name": "<PROJECT_NAME>", "remote": "<PROJECT_REMOTE>", "languages": ["<PROGRAMMING_LANGUAGE>"]}' localhost:8000/label/files -H "content-type: application/json"
curl -X POST -d '{"name": "<PROJECT_NAME>", "remote": "<PROJECT_REMOTE>", "languages": ["<PROGRAMMING_LANGUAGE>"]}' localhost:8000/label/files -H "content-type: application/json"
```

For example, to analyze the files
of [https://github.com/mickleness/pumpernickel](https://github.com/mickleness/pumpernickel), you can make the following
request:
For instance, to analyze the project at [https://github.com/mickleness/pumpernickel](https://github.com/mickleness/pumpernickel), use:

```shell
curl -X POST -d '{"name": "pumpernickel", "remote": "https://github.com/mickleness/pumpernickel", "languages": ["java"]}' localhost:8000/label/files -H "content-type: application/json"
curl -X POST -d '{"name": "pumpernickel", "remote": "https://github.com/mickleness/pumpernickel", "languages": ["java"]}' localhost:8000/label/files -H "content-type: application/json"
```

### UI
### Web UI

The tool also offers a web UI that is available at the following page (when running locally):
[http://localhost:8501](http://localhost:8501)
AutoFL provides a web-based UI accessible locally at [http://localhost:8501](http://localhost:8501):

![UI](resources/ui-screenshots/landing-page.png)

For more details, check the [UI repo](https://github.com/SasCezar/autofl-ui).

[//]: # (For more details, check the [UI repo]&#40;https://github.com/SasCezar/autofl-ui&#41;)
For more details, check the [UI repository](https://github.com/SasCezar/autofl-ui).

## Configuration

AutoFL uses [Hydra](https://hydra.cc/) to manage the configuration. The configuration files are located in
the [config](config) folder.
The main configuration file is [main.yaml](./config/main.yaml), which contains the following options:
AutoFL uses [Hydra](https://hydra.cc/) to manage configurations. The configuration files can be found in the `config` folder. The main configuration file, `main.yaml`, allows you to customize various options:

- **local**: which environment to use, either local or docker. [Docker](./config/local/docker.yaml) is default.
- **taxonomy**: which taxonomy to use. Currently only [gitranking](./config/taxonomy/gitranking.yaml) is supported, but
custom taxonomies can be added.
- **annotator**: which annotators to use. Default is [simple](./config/annotator/simple.yaml), which allows good results
without extra dependencies on language models.
- **version_strategy**: which version strategy to use. Default is [latest](./config/version_strategy/latest.yaml), which
will only analyze the latest version of the project.
- **dataloader**: which dataloader to use. Default is [postgres](./config/dataloader/postgres.yaml) which allows the API
to fetch already analysed projects.
- **writer**: which writer to use. Default is [postgres](./config/writer/postgres.yaml) which allows the API to store
the results in a database.
- **local**: Choose between local or Docker environments. [Docker](config/environment/docker.yaml) is the default.
- **taxonomy**: Set the taxonomy for labeling. Currently supports [gitranking](./config/taxonomy/gitranking.yaml). You can add custom taxonomies.
- **annotator**: Specify the annotators to use. The default is [simple](./config/annotator/simple.yaml), offering good results without dependencies on language models.
- **version_strategy**: Select the versioning strategy. The default is [latest](./config/version_strategy/latest.yaml).
- **dataloader**: Choose the dataloader. The default is [postgres](./config/dataloader/postgres.yaml).
- **writer**: Set the writer for storing results. The default is [postgres](./config/writer/postgres.yaml).

Other configuration can be defined by creating a new file in the folder of the specific component.
Additional configurations can be added by creating new files in the corresponding component folders.

## Functionalities

- Annotation (UI/API/Script)
- File
- Package
- Project
- File-Level
- Package-Level
- Project-Level
- Batch Analysis (Script Only)
- Temporal Analysis (**TODO**)
- Classification (**TODO**)
Expand All @@ -97,26 +86,23 @@ Other configuration can be defined by creating a new file in the folder of the s

## Development

The tool is composed of multiple components, their interaction is shown in the following diagram:
AutoFL is composed of multiple components, as shown in the architecture diagram below:

![Architecture](resources/architecture/architecture.png)

### Add New Languages
### Adding Support for New Languages

In order to support more languages, a new language specific parser is needed.
We can create one quickly by using [tree-sitter](https://tree-sitter.github.io/tree-sitter/),
and a custom parser.
To add support for additional languages, a language-specific parser is required. You can use [tree-sitter](https://tree-sitter.github.io/tree-sitter/) to develop a parser quickly.

#### Parser
#### Parser Details

The parser needs to be in the [parser/languages](./src/parser/languages) folder.
It has to extend the ```BaseParser``` class, which has the following interface.
The parser needs to be located in the `parser/languages` folder. It should extend the `BaseParser` class, which follows this structure:

```python
class ParserBase(ABC):
"""
Abstract class for a programming language parser.
"""
"""
Abstract class for a programming language parser.
"""

def __init__(self, library_path: Path | str):
"""
Expand All @@ -126,92 +112,73 @@ class ParserBase(ABC):
...
```

And the language specific class has to contain the logic to parse the language to get the identifiers.
For example for Python, the class will look like this:
To implement the parsing logic, create a class that handles extracting identifiers. For Python, the parser might look like:

```python
class PythonParser(ParserBase,
lang=Extension.python.name): # The lang argument is used to register the parser in the ParserFactory class.
class PythonParser(ParserBase, lang=Extension.python.name):
"""
Python specific parser. Uses a generic grammar for multiple versions of python. Uses tree_sitter to get the AST
Python-specific parser using a generic grammar for multiple versions. Utilizes tree-sitter for AST extraction.
"""

def __init__(self, library_path: Path | str):
super().__init__(library_path)
self.language: Language = Language(library_path,
Extension.python.name) # Creates the tree-sitter language for python
self.parser.set_language(self.language) # Sets tree-sitter parser to parse the language

# Pattern used to match the identifiers, it depends on the Lanugage. Check tree-sitter
self.identifiers_pattern: str = """
((identifier) @identifier)
"""

# Creates the query used to find the identifiers in the AST produced by tree-sitter
self.identifiers_query = self.language.query(self.identifiers_pattern)

# Keyword that will be ignored, in this case, the language specific keywords as the query extracts them as well.
self.keywords = set(keyword.kwlist) # Use python's built in keyword list
self.keywords.update(['self', 'cls'])
...
```

A custom class that does not rely on [tree-sitter](https://github.com/tree-sitter/tree-sitter) can be also used,
however, there are more methods from ParserBase that need to be
changed. Check the implementation of [ParserBase](src/parser/parser.py).
A custom parser independent of tree-sitter can also be developed. For more details, refer to the implementation of [ParserBase](src/parser/parser.py).

## Know Issues
## Known Issues

- **Dependency Installation**: The setup process may take significant time (~10 minutes), and dependency installations might fail due to timeouts. This appears to be a network-related issue, and retrying often resolves it. Future updates will aim to simplify dependencies.
- **~~Indefinite Analysis Loops~~**: ~~In some projects, the analysis may loop indefinitely. This issue is currently under investigation.~~ Seems solved in the latest version. Will monitor for further occurrences.

## Docker Image Availability

AutoFL is also available as a Docker image. You can pull the image from Docker Hub using:

```shell
docker pull cezarsas/autofl
```

- The installation of the dependencies requires quite some time (~10 minutes), and might fail due to timout.
Unfortunately, this issue is hard to reproduce, as it
seems to be related to the network connection. If you encounter this issue, please try again. Future versions will try
to fix this issue by
cleaning up the dependencies and reducing the number of dependencies.
- For some projects, the analysis might loop indefinitely. We are still investigating the cause of this issue.
Find more details and updates at the [Docker Hub page](https://hub.docker.com/r/cezarsas/autofl/).

## Disclaimer

The project is offered as is, it still in development, and it might not work as expected in some cases.
It has been developed and tested on Docker 24.0.7 and 25.0.0 for ```Ubuntu 22.04```. While minor testing has been done
on ```Windows``` and ```MacOS```, not all functionalities might work due to differences in Docker for these OSs (e.g.
Windows uses WSL 2).
This tool is in active development and may not function as expected in some cases. It has been tested primarily on Docker versions `24.0.7` and `25.0.0` for `Ubuntu 22.04`. Limited testing has been performed on `Windows` and `MacOS`, where functionality may vary.

In case of any problems, please open an issue, make a pull request, or contact me at ```c.a.sas@rug.nl```.
If you encounter any issues, please open an issue on GitHub, make a pull request, or contact me at `c.a.sas@rug.nl`.

## Cite
## Citation

If you use this work please cite us:
If you find this tool useful, please cite our work:

### Paper

```text
```bibtex
@article{sas2024multigranular,
title = {Multi-granular Software Annotation using File-level Weak Labelling},
author = {Cezar Sas and Andrea Capiluppi},
journal = {Empirical Software Engineering},
volume = {29},
number = {1},
pages = {12},
year = {2024},
url = {https://doi.org/10.1007/s10664-023-10423-7},
doi = {10.1007/s10664-023-10423-7}
title = {Multi-granular Software Annotation using File-level Weak Labelling},
author = {Cezar Sas and Andrea Capiluppi},
journal = {Empirical Software Engineering},
volume = {29},
number = {1},
pages = {12},
year = {2024},
url = {https://doi.org/10.1007/s10664-023-10423-7},
doi = {10.1007/s10664-023-10423-7}
}
```

**Note**: The code used in the paper is available in
the [https://github.com/SasCezar/CodeGraphClassification](https://github.com/SasCezar/CodeGraphClassification)
repository.
However, this tool is more up to date, easier to use, more configurable, and also offers a UI.
**Note**: The code used in this paper is available at [CodeGraphClassification](https://github.com/SasCezar/CodeGraphClassification). However, AutoFL provides enhanced features, is more user-friendly, and includes a UI.

### Tool

```text
```bibtex
@software{sas2023autofl,
author = {Sas, Cezar and Capiluppi, Andrea},
month = dec,
month = oct,
title = {{AutoFL}},
url = {https://github.com/SasCezar/AutoFL},
version = {0.4.1},
year = {2023},
version = {0.5.0},
year = {2024},
url = {https://doi.org/10.5281/zenodo.10255368},
doi = {10.5281/zenodo.10255368}
}
Expand Down
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion config/main.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# @package _global_
defaults:
- _self_
- local: docker
- environment: docker
- taxonomy: gitranking
- annotator: default
- version_strategy: latest
Expand Down
2 changes: 1 addition & 1 deletion config/runs.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# @package _global_
defaults:
- _self_
- local: local
- environment: docker
- run: batch_annotation

package_annotation: True
Expand Down
2 changes: 1 addition & 1 deletion config/test.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# @package _global_
defaults:
- _self_
- local: local
- environment: docker
- taxonomy: small
- annotator: default
- version_strategy: latest
Expand Down
Binary file removed data/grammars/languages.so
Binary file not shown.
Loading

0 comments on commit 6782ecf

Please sign in to comment.