fossology · smilingprogrammer · Jun 11, 2025 · Jun 11, 2025 · GMishx · Jun 11, 2025
diff --git a/docs/2025/data-pipeline/index.md b/docs/2025/data-pipeline/index.md
@@ -2,4 +2,50 @@
 sidebar_position: 2
 title: Introduction
 slug: /2025/data-pipeline/
----
+---
+<!--
+SPDX-License-Identifier: CC-BY-SA-4.0
+
+SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
+-->
+
+## Author
+
+[Abdulsobur Oyewale](https://github.com/smilingprogrammer)
+
+## Contact info
+
+- [Email](mailto:oyewaleabdulsobur@gmail.com)
+
+## Project title
+
+Data Pipelining For Safaa
+
+## What's the project about?
+
+Currently, Safaa provides a strong framework designed to deal with copyright notices particularly focusing on the identification and reduction of false positives, as well as streamlining the decluttering procedure to remove unnecessary content. Key features of Safaa include:
+1. Model Flexibility
+2. Integration with scikit-learn
+3. spaCy Integration
+4. Preprocessing Tools
+
+However, Currently in the Safaa Project, data is manually curated And we see that most of the things are manual here. 
+This project wil concentrate on creating a pipeline, Utilizing LLMS if required to increase the accuracy, or use deep learning techniques to improve. 
-This project wil concentrate on creating a pipeline, Utilizing LLMS if required to increase the accuracy, or use deep learning techniques to improve. 
+This project will concentrate on creating a pipeline, Utilizing LLMs if required to increase the accuracy, or use deep learning techniques to improve. 
-This project wil concentrate on creating a pipeline, Utilizing LLMS if required to increase the accuracy, or use deep learning techniques to improve. 
+This project will concentrate on creating a pipeline, Utilizing LLMs if required to increase the accuracy, or use deep learning techniques to improve. 
+
+Writing scripts to copy copyright data automatically(group's data or some users data) from fossology instance to train the model.
+
+
+## What should be done?
+
+Here are the key tasks planned for the project:
+
+1. Create Scripts to fetch the copyright data from FOSSology Server copyright table (localhost)
+2. Clean and preprocess fetched copyright data (utilize prewritten processing functions)
+   - Preprocess data should have label and clean text.
+3. Split data for training/validation/test.
+4. Train false/positive model as well as declutter model (utilize prewritten train functions)
+5. Model evaluation (check for precision, recall etc..)
+6. Model versioning and release.
+7. Should work for both Gitlab and Github.
+   - Manual trigger.
+   - Should also have a functionality to work as cron job.
diff --git a/docs/2025/data-pipeline/updates/2025-05-30.md b/docs/2025/data-pipeline/updates/2025-05-30.md
@@ -0,0 +1,42 @@
+---
+title: Community bonding
+author: Abdulsobur Oyewale
+---
+<!--
+SPDX-License-Identifier: CC-BY-SA-4.0
+
+SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <oyewaleabdulsobur@gmail.com>
+-->
+# **Meeting Summary for GSoC Community Bonding Period**
+
+# Introduction Meeting
+*(May 29,2025)*
+
+This was the inaugural meeting of the community bonding period for GSoC 2025.
+* A general introduction of mentors and contributors took place.
+* We were giving an introduction about the FOSSology community.
+* Time and platform for the weekly general meeting were discussed.
+* We were also engaged on the expectations for the GSoC program.
+* The Mentors also emphasized the importance of communication in open source projects.
+* At the end there was a Q&A session to address any queries we may have.
+
+# Personal Meeting With The Mentors
+*(May 30,2025)*
+
+
+* They emphasize on the importance of documentation in this project.
+* I was encouraged on the practice of regular updates.
+* We discussed about the projects and what the targets and expectations are.
+* We also discussed about timings for weekly technical calls but didn't make the final decisions since one of the mentors wasn't available with us on the call.
+* There was also discussion with my mentor on reviewing last year works.
+* We also discussed about adding my documentation to the fossology GSoC page, and submitting a pull request.
+* Lastly, I engaged with mentors on how to start my coding period by locally installing Fossology, and trying out different test to understand how it works.
+
+
+### Engagements
+
+* Explored Fossology local setup installation process
+* Treated some crucial pipeline requirements essential for Safaa's automation efforts.
+
+
+**This report summarizes my activities and interactions during the GSoC community bonding period.**
diff --git a/docs/2025/data-pipeline/updates/2025-06-04.md b/docs/2025/data-pipeline/updates/2025-06-04.md
@@ -0,0 +1,32 @@
+---
+title: Week 1
+author: Abdulsobur Oyewale
+tags: [gsoc25, Safaa Data for Pipeline]
+---
+
+<!--
+SPDX-License-Identifier: CC-BY-SA-4.0
+
+SPDX-FileCopyrightText: 2024 Shreya Gautam <oyewaleabdulsobur@gmail.com>
+-->
+
+# WEEK 1
+*(May 30, 2024)*
+
+## Attendees:
+- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
+- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)
+
+### Enagagements
+* I engaged in the installation of Fossology locally, and solved the obstacle of working with Windows. Since Fossology installation guide works best with Linux, I was able to achieve this installation with WSL2.
+* I also conducted various examples on the Safaa agent to tests out it features and functionalities which also gives me the insight of how it currently works. You can find this here.
+
+## Discussion:
+* I discoursed about how I installed Fossology with the link provided for me by my mentors and familiarized myself with its features. 
+* Furthermore, I discussed with them about the test I conducted with Safaa current copyright detection agent and then experimented with false positive deactivation agent to assess its features and functionalities by playing around it with examples.
+* Lastly, Safaa's performance was critically evaluated, and strategies for acquiring data for my Copyright script was discussed with me
+
+
+## Subsequent Steps
+* I was tasked to begin with the first task in the project list which is about the creation of script to get copyright data from a fossology instance.
+* 
diff --git a/docs/2025/data-pipeline/updates/_category_.json b/docs/2025/data-pipeline/updates/_category_.json
@@ -0,0 +1,4 @@
+{
+  "label": "Weekly Updates",
+  "position": 2
+}