Skip to content

Commit

Permalink
Merge pull request #20 from sede-open/develop
Browse files Browse the repository at this point in the history
Develop to Main
  • Loading branch information
BensGitShell authored Jan 13, 2025
2 parents 65d8cf6 + 5ad810c commit 60ba55a
Show file tree
Hide file tree
Showing 17 changed files with 935 additions and 31 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ coverage.xml
*.py,cover
.hypothesis/
.pytest_cache/
*.pem


# Translations
Expand Down
111 changes: 94 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Fleming

An open-source project of the "brain" of the AI Discovery Tool. Including techincal scripts to build, register, serve and query models on databricks which use Semantic Search. These models can be run on cpu and not gpu providing signiifcant cost reductions.
<img align="right" src="docs/images/logo.png" title="Logo Discovery" alt="Logo Discovery" width="33%"></a>

Databricks, a popular big data processing and analytics platform, is utilized to build and train machine learning models on the ingested data.
An open-source project of the "brain" of the AI Discovery Tool. Including technical scripts to build, register, serve and query models on databricks which use Semantic Search. These models can be run on cpu and not gpu providing signiifcant cost reductions.

By combining data ingestion from GitHub with Databricks' model training and serving capabilities, pipelines can provide a seamless end-to-end solution for processing and analyzing data from GitHub repositories.
[Databricks](https://www.databricks.com), a popular big data processing and analytics platform, is utilized to build and train machine learning models on the ingested data.

The serving endpont designed to process and analyze large volumes of data, enabling efficient data discovery and insights.
By combining data ingestion from GitHub with Databricks' model training and serving capabilities, pipelines can provide a seamless end-to-end solution for processing and analyzing data from GitHub repositories.

![Logo Discovery](docs/images/logo.png)
The serving endpoint designed to process and analyze large volumes of data, enabling efficient data discovery and insights.

## Corpus Creation

Expand All @@ -20,9 +20,7 @@ The class contains the following methods:

2. write_corpus_to_file: Write the corpus to a file from the concatenated columns.


```python

from fleming.discovery.corpus_creation import CorpusCreation
from pyspark.sql import SparkSession

Expand All @@ -41,8 +39,7 @@ corpus_creation.write_corpus_to_file(corpus)

A class to train and register a semantic search model.

```python

```python
from fleming.discovery.model_train_register import ModelTrainRegister, SemanticSearchModel
from pyspark.sql import SparkSession

Expand All @@ -64,7 +61,7 @@ model_developer.embed_corpus()

# Define parameters and artifacts
parameters = {"top_k": 50, "relevancy_score": 0.45}
input_example = ["Innersource best practices"]
input_example = ["InnerSource best practices"]
test_output = ["match 1", "match 2"]
signature = infer_signature(input_example, test_output, params=parameters)
artifacts = {
Expand All @@ -82,10 +79,8 @@ model_developer.create_registered_model(unique_model_name, input_example, signat
## Model Serving

This class creates a model serving endpoint on databricks.


```python

```python
from fleming.discovery.corpus_creation import CorpusCreation
from pyspark.sql import SparkSession

Expand Down Expand Up @@ -130,7 +125,6 @@ A class which allows for querying a model serving endpoint on databricks.
This class is used to query a model serving endpoint on databricks with a dataset.

```python

url = "https://example.com/model_endpoint"
token = "your_auth_token"

Expand All @@ -146,8 +140,60 @@ try:
print(response)
except requests.exceptions.HTTPError as e:
print(f"Error: {str(e)}")
```

## Github Repository Extraction and Tokenisation

This class ingests the files from a list of repositories in an organization and processes the contents to return a dataframe of the contents concatenated.

```python

from fleming.code_summary.repo_contents_tokenisation import GitHubRepoDataProcessor
from pyspark.sql import SparkSession

# Not required if using Databricks
spark = SparkSession.builder.appName("RepoConcat").getOrCreate()

organization_name = 'company-x'
repo_list = ['repo1', 'repo2', 'repo3']
num_token_per_repo = 100000
pem_key = 'xxxxx'
pem_file = '/dbfs/FileStore/github_app/pem_key.pem'

github_repo_data_processor = GitHubRepoDataProcessor(spark, organization_name, repo_list, num_token_per_repo, pem_key, pem_file)
repo_contents_df = github_repo_data_processor.data_collection()
repo_contents_df_concat = github_repo_data_processor.concatenate_repo_contents(repo_contents_df)

```
## Code Summarisation

This class authenticates to the Azure OpenAI API and passes the concatenated repository contents into the LLM with an accompanying prompt to generate documentation.

```python
from fleming.code_summary.fourO_mini_summary import call_openai
from pyspark.sql import SparkSession

# Not required if using Databricks
spark = SparkSession.builder.appName("openai_client").getOrCreate()

spark_input_df = "your_spark_input_df"
output_table_name = "your_output_table"

prompt = "The following code is the contents of a repository, generate a short summary paragraph describing what the repository purpose is. A paragraph detailing the key functionalities and technologies integrate with and a list of key words associated with this repository underneath. Focus on the purpose of the code contained in the repository, and the technologies, data and platforms it integrates with"

api_key = "your_api_key"
endpoint = "https://api.openai.com/yourendpointhere"

headers = {
"Content-Type": "application/json",
"api-key": api_key,
}

client = OpenAIClient(spark, delta_table, output_table_name, prompt, api_key, endpoint, headers)

client.call_openai()
```


# Repository Guidelines

Expand Down Expand Up @@ -176,25 +222,53 @@ except requests.exceptions.HTTPError as e:

# Developer Guide - How to Use

## Prerequisite

### Python

There are a few things to note before using the Fleming. The following prerequisites will need to be installed on your local machine.

Python version 3.9 >= and < 3.12 should be installed. Check which python version you have with the following command:

python --version

Find the latest python version [here](https://www.python.org/downloads/) and ensure your python path is set up correctly on your machine.

### Python Package Installers

Installing Fleming can be done using the package installer [Micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html).

### Java

To use Fleming in your own environment that leverages [pyspark](https://spark.apache.org/docs/latest/api/python/getting_started/install.html), Java 8 or later is a [prerequisite](https://spark.apache.org/docs/latest/api/python/getting_started/install.html#dependencies). See below for suggestions to install Java in your development environment.

Follow the official Java JDK installation documentation [here.](https://docs.oracle.com/en/java/javase/11/install/overview-jdk-installation.html)

- [Windows](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-microsoft-windows-platforms.html)
- [Mac OS](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-macos.html)
- [Linux](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-linux-platforms.html)

!!! note
Windows requires an additional installation of a file called **winutils.exe**. Please see this [repo](https://github.com/steveloughran/winutils) for more information.

## Getting Started

1) To get started with developing for this project, clone the repository.
```
git clone https://github.com/sede-open/Fleming.git
```
2) Open the respository in VS Code, Visual Studio or your preferered code editor.
2) Open the repository in VS Code, Visual Studio or your preferred code editor.

3) Create a new environment using the following command:
```
micromamba create -f environment.yml
```

> **_NOTE:_** You will need to have conda, python and pip installed to use the command above.
4) Activate your newly set up environment using the following command:
```
micromamba activate
micromamba activate fleming
```
You are now ready to start developing your own functions. Please remember to follow Felming's development lifecycle to maintain clarity and efficiency for a fully robust self serving platform.

Expand Down Expand Up @@ -232,10 +306,13 @@ You are now ready to start developing your own functions. Please remember to fol
> **_NOTE:_** Ensure you have read the [Release Guidelines](RELEASE.md) before publishing your code.
# Support and contacts

If you encounter any issues or have questions, please reach out to the team by raising an issue on the repo. They will be happy to assist you and provide further information about the project.

# Contributing

Contributions to this project are welcome! If you would like to contribute, please refer to our [Contributing Guide](CONTRIBUTION.md) for guidelines on how to get started. We appreciate your support in making this project even better.

# Licensing

The code in this repository is licensed under the default copyright notice, which can be found in the [LICENSE](LICENSE) file. Please review the license before using or distributing the code.
6 changes: 5 additions & 1 deletion docs/blog/.authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,8 @@ authors:
GBCDFAS:
name: Christian DeFeo
description: Contributor
avatar: https://github.com/doctorcdf27.png
avatar: https://github.com/doctorcdf27.png
GBBBAS:
name: Ben Butcher
description: Contributor
avatar: https://github.com/bensgitshell.png
28 changes: 28 additions & 0 deletions docs/blog/posts/code-summarisation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
date: 2025-01-13
authors:
- GBBBAS
---

# Introducing Project Fleming: Enhancing Code Reuse and Efficiency with AI Discovery

<center>
![code_summary](../images/code_summary.png){width=75%}
</center>


We are delighted to announce the release of an enhancement to Project Fleming, with the addition of the Code Summary functionality. Once code has been summarised, this can be passed into Project Fleming's corpus to improve search for specific code functionality.

<!-- more -->

## Code Summary

The purpose of the Code Summary package is to improve a user's search for explicit code functionality, by utilising LLMs to ingest source code of repositories and generate comprehensive and descriptive documentation, that can then be indexed by Project Fleming and consequently improving the quality of results.

The initial use-case of Project Fleming was to ingest repository readme contents into the search index, thereby allowing users to perform natural language searches, based off the information provided in the readme file by the developers. However, this was inherently limited by the fact it was reliant on the developer producing a descriptive readme. Furthermore, in some cases developers are unable to write a full functional description of all their code within a larger repository, as they are (reasonably) trying to give a high level overview of the entire project, instead of detailing specific code functionality.

Enter Project Fleming's Code Summary Package, which can be tailored to generate descriptive code functionality documentation for repositories, thus increasing the likelihood that specific code relevant to the users query is found when this documentation is ingested into the search index of Project Fleming. This use of LLMs is designed to allow users to find reusable code, that may have been previously hidden by the domain context it is situated in, as it is no longer reliant on the specific functionality being described in the readme.

## Further applications

We see this as having multiple applications, not only to enhance the findability of reusable code, but also to help organisations with large-scale code scanning to find specific types of projects. For example, the Code Summary package could be used to identify AI projects, which could then be assessed to see if they are compliant with the local regulations. This is only one of many potential applications, and as always, we actively welcome feedback and further contributions to further enhance Project Fleming.
1 change: 1 addition & 0 deletions docs/code-reference/4o-MiniSummary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.fleming.code_summary.fourO_mini_summary
1 change: 1 addition & 0 deletions docs/code-reference/RepoContentsTokenisation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.fleming.code_summary.repo_contents_tokenisation
34 changes: 34 additions & 0 deletions docs/getting-started/code-summary-how-to.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Code Summarisation Using LLM for Search Index Enhancement

# Documentation
Once you have extracted and concatenated the contents of a repository, this class can be used to then generate descriptive documentation using Azure OpenAI. Please note, this class requires you to have a working Azure OpenAI API Key.

This class then authenticates to the API, and passes in the concatenated repository content with a prompt of your choosing (an example has been provided below). The output is descriptive documentation that identifies code functionality, which can then be indexed by Fleming, to allow users to find specific code functionality that may not have been sufficiently detailed in the readme, or hidden by a domain context.

For more information about options within the Class please follow the documentation under the [code-reference](../code-reference/4o-MiniSummary.md) section.

# Example

```python
from fleming.code_summary.fourO_mini_summary import call_openai
from pyspark.sql import SparkSession

# Not required if using Databricks
spark = SparkSession.builder.appName("openai_client").getOrCreate()

spark_input_df = "your_spark_input_df"
output_table_name = "your_output_table"

prompt = "The following code is the contents of a repository, generate a short summary paragraph describing what the repository purpose is. A paragraph detailing the key functionalities and technologies integrate with and a list of key words associated with this repository underneath. Focus on the purpose of the code contained in the repository, and the technologies, data and platforms it integrates with"

api_key = "your_api_key"
endpoint = "https://api.openai.com/yourendpointhere"

headers = {
"Content-Type": "application/json",
"api-key": api_key,
}

client = OpenAIClient(spark, delta_table, output_table_name, prompt, api_key, endpoint, headers)
client.call_openai()
```
36 changes: 34 additions & 2 deletions docs/getting-started/installation.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,36 @@
# Installation
# Installation Process

## Prerequisite

### Python

There are a few things to note before using the Fleming. The following prerequisites will need to be installed on your local machine.

Python version 3.9 >= and < 3.12 should be installed. Check which python version you have with the following command:

python --version

Find the latest python version [here](https://www.python.org/downloads/) and ensure your python path is set up correctly on your machine.

### Python Package Installers

Installing Fleming can be done using the package installer [Micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html).

### Java

To use Fleming in your own environment that leverages [pyspark](https://spark.apache.org/docs/latest/api/python/getting_started/install.html), Java 8 or later is a [prerequisite](https://spark.apache.org/docs/latest/api/python/getting_started/install.html#dependencies). See below for suggestions to install Java in your development environment.

Follow the official Java JDK installation documentation [here.](https://docs.oracle.com/en/java/javase/11/install/overview-jdk-installation.html)

- [Windows](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-microsoft-windows-platforms.html)
- [Mac OS](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-macos.html)
- [Linux](https://docs.oracle.com/en/java/javase/11/install/installation-jdk-linux-platforms.html)

!!! note
Windows requires an additional installation of a file called **winutils.exe**. Please see this [repo](https://github.com/steveloughran/winutils) for more information.


## Installation

1) To get started with developing for this project, clone the repository.
```
Expand All @@ -16,7 +48,7 @@
4) Activate your newly set up environment using the following command:
```
micromamba activate
micromamba activate fleming
```
You are now ready to start developing your own functions. Please remember to follow Felming's development lifecycle to maintain clarity and efficiency for a fully robust self serving platform.

Expand Down
34 changes: 34 additions & 0 deletions docs/getting-started/tokenisation-of-git-repo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@

# Extracting GitHub Repo Contents and Concatenating for Summarisation

# Documentation

To prepare a github repository to be summarised it is important to extract all relevant files and concantenate the information. This class ingests the files from a list of repositories in an organization and processes the contents to return a dataframe of the contents concatenated.

Different summerisation tools will have different limits on the number of tokens which can be ingested, the below class allows you to limit this for each repository ingested.

Please find an example below.

For more information about options within the Class please follow the documentation under the [code-reference](../code-reference/RepoContentsTokenisation.md) section.

# Example

```python

from fleming.code_summary.repo_contents_tokenisation import GitHubRepoDataProcessor
from pyspark.sql import SparkSession

# Not required if using Databricks
spark = SparkSession.builder.appName("RepoConcat").getOrCreate()

organization_name = 'company-x'
repo_list = ['repo1', 'repo2', 'repo3']
num_token_per_repo = 100000
pem_key = 'xxxxx'
pem_file = '/dbfs/FileStore/github_app/pem_key.pem'

github_repo_data_processor = GitHubRepoDataProcessor(spark, organization_name, repo_list, num_token_per_repo, pem_key, pem_file)
repo_contents_df = github_repo_data_processor.data_collection()
repo_contents_df_concat = github_repo_data_processor.concatenate_repo_contents(repo_contents_df)

```
10 changes: 8 additions & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,15 @@ dependencies:
- mkdocstrings-python==1.10.8
- mkdocs-macros-plugin==1.0.1
- openjdk=11
- tiktoken==0.8.0
- time>=1.0.0
- pip:
- databricks-sdk>=0.20.0,<1.0.0
- torch>=2.4.1
- pyspark>=3.3.0,<3.6.0
- pyspark>=3.4.0,<3.6.0
- sentence-transformers>=3.2.0
- mlflow>=2.0.1
- mlflow>=2.0.1
- beautifulsoup4==4.12.3
- PyGithub==2.5.0
- jwt==1.3.1

Loading

0 comments on commit 60ba55a

Please sign in to comment.