📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
-
Updated
Feb 21, 2026 - Python
📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
[Pedestron] Generalizable Pedestrian Detection: The Elephant In The Room. @ CVPR2021
Data Labeling, Tracking and Annotation with AI
Curate, evaluate, and ship LLM datasets from any document.
LoRA Pilot is an ultimate docker image for all Stable Diffusion LoRA trainers. Includes kohya_ss, diffusion pipes and TensorBoard for trainings and ComfyUI and InvokeAI for validation. Features shared models, modules, custom integrations and automatization scripts.
Anonymize sensitive data in your datasets.
(Windows/Linux) Local WebUI for finetuning, evaluation and generation of neural network models (LLM and StableDiffusion) on python (In Gradio interface). Translated on 3 languages
MALVADA: Malware Execution Traces Dataset generation.
Make AVADataset custom dataset.
Utilities for working with the Common Voice dataset
Pokemon card automatic images downloader
Low Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit
This project contains various scripts that can assist in the process of preparing datasets.
NanoListener: a small suite of Python scripts to create custom training datasets for modification-aware basecaller models.
A GUI application to tag images and edit them using a editor.
Utility to making datasets of images and points coordinates that have been marked up on these images by user
Katachi is a Python framework for validating and processing hierarchical directory structures using YAML-based schemas. It ensures your folders and files follow expected shapes, naming rules, and relationships—before any processing begins. Use it to enforce structure, catch issues early, and keep your data pipelines reliable.
While working on a Unet project, I created a program that can be used to add noise, a random grid (textbook) and a random shade of grey , this tool will output (depending on witch variation) combinations of two images the noisy image ut self and the clear one for the first variation (this one gave better results with Unet application) while the …
Conversations / Instructions Editor
Atomic Dataset Generator for training ML potentials
Add a description, image, and links to the datasets-preparation topic page so that developers can more easily learn about it.
To associate your repository with the datasets-preparation topic, visit your repo's landing page and select "manage topics."