Skip to content

docs(2025): updated community bonding and week 1 documentation #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 47 additions & 1 deletion docs/2025/data-pipeline/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,50 @@
sidebar_position: 2
title: Introduction
slug: /2025/data-pipeline/
---
---
<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

## Author

[Abdulsobur Oyewale](https://github.com/smilingprogrammer)

## Contact info

- [Email](mailto:oyewaleabdulsobur@gmail.com)

## Project title

Data Pipelining For Safaa

## What's the project about?

Currently, Safaa provides a strong framework designed to deal with copyright notices particularly focusing on the identification and reduction of false positives, as well as streamlining the decluttering procedure to remove unnecessary content. Key features of Safaa include:
1. Model Flexibility
2. Integration with scikit-learn
3. spaCy Integration
4. Preprocessing Tools

However, Currently in the Safaa Project, data is manually curated And we see that most of the things are manual here.
This project will concentrate on creating a pipeline, Utilizing LLMs if required to increase the accuracy, or use deep learning techniques to improve.

Writing scripts to copy copyright data automatically(group's data or some users data) from fossology instance to train the model.


## What should be done?

Here are the key tasks planned for the project:

1. Create Scripts to fetch the copyright data from FOSSology Server copyright table (localhost)
2. Clean and preprocess fetched copyright data (utilize prewritten processing functions)
- Preprocess data should have label and clean text.
3. Split data for training/validation/test.
4. Train false/positive model as well as declutter model (utilize prewritten train functions)
5. Model evaluation (check for precision, recall etc..)
6. Model versioning and release.
7. Should work for both Gitlab and Github.
- Manual trigger.
- Should also have a functionality to work as cron job.
42 changes: 42 additions & 0 deletions docs/2025/data-pipeline/updates/2025-05-30.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Community bonding
author: Abdulsobur Oyewale
---
<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->
# **Meeting Summary for GSoC Community Bonding Period**

# Introduction Meeting
*(May 29,2025)*

This was the inaugural meeting of the community bonding period for GSoC 2025.
* A general introduction of mentors and contributors took place.
* We were giving an introduction about the FOSSology community.
* Time and platform for the weekly general meeting were discussed.
* We were also engaged on the expectations for the GSoC program.
* The Mentors also emphasized the importance of communication in open source projects.
* At the end there was a Q&A session to address any queries we may have.

# Personal Meeting With The Mentors
*(May 30,2025)*


* They emphasize on the importance of documentation in this project.
* I was encouraged on the practice of regular updates.
* We discussed about the projects and what the targets and expectations are.
* We also discussed about timings for weekly technical calls but didn't make the final decisions since one of the mentors wasn't available with us on the call.
* There was also discussion with my mentor on reviewing last year works.
* We also discussed about adding my documentation to the fossology GSoC page, and submitting a pull request.
* Lastly, I engaged with mentors on how to start my coding period by locally installing Fossology, and trying out different test to understand how it works.


### Engagements

* Explored Fossology local setup installation process
* Treated some crucial pipeline requirements essential for Safaa's automation efforts.


**This report summarizes my activities and interactions during the GSoC community bonding period.**
31 changes: 31 additions & 0 deletions docs/2025/data-pipeline/updates/2025-06-04.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: Week 1
author: Abdulsobur Oyewale
tags: [gsoc25, Safaa Data for Pipeline]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

# WEEK 1
*(June 4, 2024)*

## Attendees:
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)

### Engagements
* I engaged in the installation of Fossology locally, and solved the obstacle of working with Windows. Since Fossology installation guide works best with Linux, I was able to achieve this installation with WSL2.
* I also conducted various examples on the Safaa agent to tests out it features and functionalities which also gives me the insight of how it currently works. You can find this here.

## Discussion:
* I discoursed about how I installed Fossology with the link provided for me by my mentors and familiarized myself with its features.
* Furthermore, I discussed with them about the test I conducted with Safaa current copyright detection agent and then experimented with false positive deactivation agent to assess its features and functionalities by playing around it with examples.
* Lastly, Safaa's performance was critically evaluated, and strategies for acquiring data for my Copyright script was discussed with me


## Subsequent Steps
* I was tasked to begin with the first task in the project list which is about the creation of script to get copyright data from a fossology instance.
37 changes: 37 additions & 0 deletions docs/2025/data-pipeline/updates/2025-06-11.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
title: Week 2
author: Abdulsobur Oyewale
tags: [gsoc25, Safaa Data for Pipeline]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

# WEEK 1
*(June 11, 2024)*

## Attendees:
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)
- [Kaushlendra Pratap](https://github.com/Kaushl2208)

### Engagements
* This week i started full engagement with this year project. And the first task on the list to achieve this goal is the creation of a script to fetch copyright content from the fossology server.
* I started by trying to write out SQL codes to fetch this content from the fossology server, and after different tweaking i was able to achieve this goal.
* After a successful writing of the SQL script to fetch the required content from the fossology server, I proceeded to write a python program to embed the PostgreSQL script into the program using the psycog library to achieve the connection to the Postgres database server.
* With this, i was able to automate the collection of copyright content data from the fossology server running in the local host.


## Meeting Discussion:
* I discuss with the mentors about the progress of the week and how the project s going, including if there was any obstacle.
* We discussed about the current progress which is the content fetching script from the fossology localhost server.
* I also gave them a demo to show them how it works and the expected output from the script.


## Subsequent Steps
* I was tasked to write include timestamp with the generated data, so as to track the sequence data update
* I was also told to make some changes for the script to accommodate various sever configuration by placing the server configuration in a `.env` file.
* And I will also continue with the preprocessing script which will allow us to preprocess the data we got from the script fetched from the fossology server.
36 changes: 36 additions & 0 deletions docs/2025/data-pipeline/updates/2025-06-18.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
title: Week 2
author: Abdulsobur Oyewale
tags: [gsoc25, Safaa Data for Pipeline]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

# WEEK 1
*(June 18, 2024)*

## Attendees:
- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)
- [Kaushlendra Pratap](https://github.com/Kaushl2208)

### Engagements
* This week I began with the second task on the list, which is the creation of a script to preprocess copyright content from the fossology server.
* I was informed last week of the available pre-written script available on the Safaa codebase which I can utilize to make this task faster to complete.
* I then began by starting to write out this pre-written script, reading the code, understanding it, then before modifying it to suit our intent.
* After completing the above task, I modified the script to match our int. With this, I was able to preprocess the data we retrieved from the fossology server running in the local host.


## Meeting Discussion:
* I discuss with the mentors about the progress of the week and how the project s going, including if there was any obstacle.
* We discussed the current progress which is the preprocessing of data fetched from the fossology localhost server using available pre-written script.
* I also gave them a demo to show them how it works and the expected output from the script.
* I was told the task needs to be modified to so that it can be triggered using GitHub actions, and not manually via coding script.


## Subsequent Steps
* Given that we already have a working preprocessing, I was tasked to modify this to be triggered with GitHub Actions.
* I will be continuing with the task above for the next week task achievements.
4 changes: 4 additions & 0 deletions docs/2025/data-pipeline/updates/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"label": "Weekly Updates",
"position": 2
}