Skip to content

Data Pipeline for Safaa weekly update/report. #308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 47 additions & 1 deletion docs/2025/data-pipeline/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,50 @@
sidebar_position: 2
title: Introduction
slug: /2025/data-pipeline/
---
---
<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->

## Author

[Abdulsobur Oyewale](https://github.com/smilingprogrammer)

## Contact info

- [Email](mailto:oyewaleabdulsobur@gmail.com)

## Project title

Data Pipelining For Safaa

## What's the project about?

Currently, Safaa provides a strong framework designed to deal with copyright notices particularly focusing on the identification and reduction of false positives, as well as streamlining the decluttering procedure to remove unnecessary content. Key features of Safaa include:
1. Model Flexibility
2. Integration with scikit-learn
3. spaCy Integration
4. Preprocessing Tools

However, Currently in the Safaa Project, data is manually curated And we see that most of the things are manual here.
This project wil concentrate on creating a pipeline, Utilizing LLMS if required to increase the accuracy, or use deep learning techniques to improve.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This project wil concentrate on creating a pipeline, Utilizing LLMS if required to increase the accuracy, or use deep learning techniques to improve.
This project will concentrate on creating a pipeline, Utilizing LLMs if required to increase the accuracy, or use deep learning techniques to improve.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you


Writing scripts to copy copyright data automatically(group's data or some users data) from fossology instance to train the model.


## What should be done?

Here are the key tasks planned for the project:

1. Create Scripts to fetch the copyright data from FOSSology Server copyright table (localhost)
2. Clean and preprocess fetched copyright data (utilize prewritten processing functions)
- Preprocess data should have label and clean text.
3. Split data for training/validation/test.
4. Train false/positive model as well as declutter model (utilize prewritten train functions)
5. Model evaluation (check for precision, recall etc..)
6. Model versioning and release.
7. Should work for both Gitlab and Github.
- Manual trigger.
- Should also have a functionality to work as cron job.
42 changes: 42 additions & 0 deletions docs/2025/data-pipeline/updates/2025-05-30.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Community bonding
author: Abdulsobur Oyewale
---
<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
-->
# **Meeting Summary for GSoC Community Bonding Period**

# Introduction Meeting
*(May 29,2025)*

This was the inaugural meeting of the community bonding period for GSoC 2025.
* A general introduction of mentors and contributors took place.
* We were giving an introduction about the FOSSology community.
* Time and platform for the weekly general meeting were discussed.
* We were also engaged on the expectations for the GSoC program.
* The Mentors also emphasized the importance of communication in open source projects.
* At the end there was a Q&A session to address any queries we may have.

# Personal Meeting With The Mentors
*(May 30,2025)*


* They emphasize on the importance of documentation in this project.
* I was encouraged on the practice of regular updates.
* We discussed about the projects and what the targets and expectations are.
* We also discussed about timings for weekly technical calls but didn't make the final decisions since one of the mentors wasn't available with us on the call.
* There was also discussion with my mentor on reviewing last year works.
* We also discussed about adding my documentation to the fossology GSoC page, and submitting a pull request.
* Lastly, I engaged with mentors on how to start my coding period by locally installing Fossology, and trying out different test to understand how it works.


### Engagements

* Explored Fossology local setup installation process
* Treated some crucial pipeline requirements essential for Safaa's automation efforts.


**This report summarizes my activities and interactions during the GSoC community bonding period.**
32 changes: 32 additions & 0 deletions docs/2025/data-pipeline/updates/2025-06-04.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
title: Week 1
author: Abdulsobur Oyewale
tags: [gsoc25, Safaa Data for Pipeline]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2024 Shreya Gautam <oyewaleabdulsobur@gmail.com>
-->

# WEEK 1
*(May 30, 2024)*

## Attendees:
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)

### Enagagements
* I engaged in the installation of Fossology locally, and solved the obstacle of working with Windows. Since Fossology installation guide works best with Linux, I was able to achieve this installation with WSL2.
* I also conducted various examples on the Safaa agent to tests out it features and functionalities which also gives me the insight of how it currently works. You can find this here.

## Discussion:
* I discoursed about how I installed Fossology with the link provided for me by my mentors and familiarized myself with its features.
* Furthermore, I discussed with them about the test I conducted with Safaa current copyright detection agent and then experimented with false positive deactivation agent to assess its features and functionalities by playing around it with examples.
* Lastly, Safaa's performance was critically evaluated, and strategies for acquiring data for my Copyright script was discussed with me


## Subsequent Steps
* I was tasked to begin with the first task in the project list which is about the creation of script to get copyright data from a fossology instance.
*
4 changes: 4 additions & 0 deletions docs/2025/data-pipeline/updates/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"label": "Weekly Updates",
"position": 2
}