Data deduplication engine, supporting optional compression and public key encryption.
-
Updated
Aug 25, 2022 - Rust
Data deduplication engine, supporting optional compression and public key encryption.
Official Repository of "LLM × DATA" Survey Paper
🚢 Data Toolkit for Sailor Language Models
Self-contained C# library for data deduplication using Sqlite
Fast and efficient content-defined chunking for data deduplication. Java implementation of FastCDC as library.
Optimal distributed data deduplication and supervised learning pipeline using Apache Spark
A JAVA project that splits data using hashing techniques and removes duplicate blocks to save cloud storage. This project also uses the CloudSim framework for cloud storage simulation.
General deduping engine for JDBC sources with output to JDBC/csv targets
A pure-JS, content-addressed, copy-on-write virtual filesystem for the browser, featuring: deduplication, filesystem universes (snapshots), events, and optional asynchronous sync.
RepoCapsule is a Python toolkit for turning GitHub, local, and other text/code sources into clean JSONL corpora for LLM pre-training, fine-tuning, or RAG. It provides structure-aware chunking, robust Unicode decoding, pluggable quality/safety screening, and optional dataset card + deduplication support.
PolyDeDupe: Multi-Lingual Data Deduplication
Enterprise-grade SaaS platform for importing, cleaning, and managing large-scale mailing lists with advanced deduplication and enrichment.
Этот проект представляет собой мощный инструмент для поиска и анализа дублирующихся файлов в указанной директории. Программа позволяет эффективно выявлять одинаковые файлы на основе их содержимого, используя алгоритм хеширования SHA-256. Она поддерживает настройку параметров, таких как минимальный размер файла для проверки и игнорирование определен
A calculator for storage and transmission of deduplicated data. Output: charts and tables
Fellow is a package for creating people that can be unified by their shared values via a singleton list on the class
Automated business record matching using fuzzy algorithms (RapidFuzz) and browser automation (Playwright)
ETL workflow for stock data processing using Mage and PostgreSQL
The HR Roster Change Detection Pipeline is an automated solution for processing HR roster data. Leveraging Apache Airflow and PostgreSQL, it enables seamless data ingestion, deduplication, and change detection, streamlining HR operations.
fast dataset merging and deduplication tool
Add a description, image, and links to the data-deduplication topic page so that developers can more easily learn about it.
To associate your repository with the data-deduplication topic, visit your repo's landing page and select "manage topics."