-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #20 from sede-open/develop
Develop to Main
- Loading branch information
Showing
17 changed files
with
935 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -76,6 +76,7 @@ coverage.xml | |
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
*.pem | ||
|
||
|
||
# Translations | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
--- | ||
date: 2025-01-13 | ||
authors: | ||
- GBBBAS | ||
--- | ||
|
||
# Introducing Project Fleming: Enhancing Code Reuse and Efficiency with AI Discovery | ||
|
||
<center> | ||
![code_summary](../images/code_summary.png){width=75%} | ||
</center> | ||
|
||
|
||
We are delighted to announce the release of an enhancement to Project Fleming, with the addition of the Code Summary functionality. Once code has been summarised, this can be passed into Project Fleming's corpus to improve search for specific code functionality. | ||
|
||
<!-- more --> | ||
|
||
## Code Summary | ||
|
||
The purpose of the Code Summary package is to improve a user's search for explicit code functionality, by utilising LLMs to ingest source code of repositories and generate comprehensive and descriptive documentation, that can then be indexed by Project Fleming and consequently improving the quality of results. | ||
|
||
The initial use-case of Project Fleming was to ingest repository readme contents into the search index, thereby allowing users to perform natural language searches, based off the information provided in the readme file by the developers. However, this was inherently limited by the fact it was reliant on the developer producing a descriptive readme. Furthermore, in some cases developers are unable to write a full functional description of all their code within a larger repository, as they are (reasonably) trying to give a high level overview of the entire project, instead of detailing specific code functionality. | ||
|
||
Enter Project Fleming's Code Summary Package, which can be tailored to generate descriptive code functionality documentation for repositories, thus increasing the likelihood that specific code relevant to the users query is found when this documentation is ingested into the search index of Project Fleming. This use of LLMs is designed to allow users to find reusable code, that may have been previously hidden by the domain context it is situated in, as it is no longer reliant on the specific functionality being described in the readme. | ||
|
||
## Further applications | ||
|
||
We see this as having multiple applications, not only to enhance the findability of reusable code, but also to help organisations with large-scale code scanning to find specific types of projects. For example, the Code Summary package could be used to identify AI projects, which could then be assessed to see if they are compliant with the local regulations. This is only one of many potential applications, and as always, we actively welcome feedback and further contributions to further enhance Project Fleming. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
::: src.fleming.code_summary.fourO_mini_summary |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
::: src.fleming.code_summary.repo_contents_tokenisation |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Code Summarisation Using LLM for Search Index Enhancement | ||
|
||
# Documentation | ||
Once you have extracted and concatenated the contents of a repository, this class can be used to then generate descriptive documentation using Azure OpenAI. Please note, this class requires you to have a working Azure OpenAI API Key. | ||
|
||
This class then authenticates to the API, and passes in the concatenated repository content with a prompt of your choosing (an example has been provided below). The output is descriptive documentation that identifies code functionality, which can then be indexed by Fleming, to allow users to find specific code functionality that may not have been sufficiently detailed in the readme, or hidden by a domain context. | ||
|
||
For more information about options within the Class please follow the documentation under the [code-reference](../code-reference/4o-MiniSummary.md) section. | ||
|
||
# Example | ||
|
||
```python | ||
from fleming.code_summary.fourO_mini_summary import call_openai | ||
from pyspark.sql import SparkSession | ||
|
||
# Not required if using Databricks | ||
spark = SparkSession.builder.appName("openai_client").getOrCreate() | ||
|
||
spark_input_df = "your_spark_input_df" | ||
output_table_name = "your_output_table" | ||
|
||
prompt = "The following code is the contents of a repository, generate a short summary paragraph describing what the repository purpose is. A paragraph detailing the key functionalities and technologies integrate with and a list of key words associated with this repository underneath. Focus on the purpose of the code contained in the repository, and the technologies, data and platforms it integrates with" | ||
|
||
api_key = "your_api_key" | ||
endpoint = "https://api.openai.com/yourendpointhere" | ||
|
||
headers = { | ||
"Content-Type": "application/json", | ||
"api-key": api_key, | ||
} | ||
|
||
client = OpenAIClient(spark, delta_table, output_table_name, prompt, api_key, endpoint, headers) | ||
client.call_openai() | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
|
||
# Extracting GitHub Repo Contents and Concatenating for Summarisation | ||
|
||
# Documentation | ||
|
||
To prepare a github repository to be summarised it is important to extract all relevant files and concantenate the information. This class ingests the files from a list of repositories in an organization and processes the contents to return a dataframe of the contents concatenated. | ||
|
||
Different summerisation tools will have different limits on the number of tokens which can be ingested, the below class allows you to limit this for each repository ingested. | ||
|
||
Please find an example below. | ||
|
||
For more information about options within the Class please follow the documentation under the [code-reference](../code-reference/RepoContentsTokenisation.md) section. | ||
|
||
# Example | ||
|
||
```python | ||
|
||
from fleming.code_summary.repo_contents_tokenisation import GitHubRepoDataProcessor | ||
from pyspark.sql import SparkSession | ||
|
||
# Not required if using Databricks | ||
spark = SparkSession.builder.appName("RepoConcat").getOrCreate() | ||
|
||
organization_name = 'company-x' | ||
repo_list = ['repo1', 'repo2', 'repo3'] | ||
num_token_per_repo = 100000 | ||
pem_key = 'xxxxx' | ||
pem_file = '/dbfs/FileStore/github_app/pem_key.pem' | ||
|
||
github_repo_data_processor = GitHubRepoDataProcessor(spark, organization_name, repo_list, num_token_per_repo, pem_key, pem_file) | ||
repo_contents_df = github_repo_data_processor.data_collection() | ||
repo_contents_df_concat = github_repo_data_processor.concatenate_repo_contents(repo_contents_df) | ||
|
||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.