DSIR large-scale data selection framework for language model training
-
Updated
Apr 7, 2024 - Python
DSIR large-scale data selection framework for language model training
GUNDAM is a data management system that prioritizes data using language models.
Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".
[ACL 2025 main] SCAR: Data Selection via Style Consistency-Aware Response Ranking for Efficient Instruction-Tuning of Large Language Models
Framework for processing and filtering datasets
This repository contains all (Python 3) code and libraries required for the 2022-2023 Notre Dame Rocketry Team (NDRT) Apogee Control System (ACS). It also contains sensor/actuator example code and flight data.
Base-call error-filtering and read preprocessing pipeline for fastq libraries
Anonymises data inside text files and in sheet files. It recognises and removes various sorts of personally identifiable information (PII). Each removed part is replaced with a suitable generic text, depending on the type of removed data. Currently English and Russian languages are supported. Russian works both with Cyrillic and Latin characters.
A powerful tool that allows users to query JSON data using SQL-like syntax. Effortlessly search, filter, and manipulate your JSON data with familiar SQL queries.
A multi-parameter sequential search utility for filtering through an input Excel Datasheet.
A powerful and flexible data filtering library with unified interface for multiple data sources including Peewee ORM, Pydantic models, and Python iterables. Flask-friendly.
🤖Ngram Similarity Engine📚
A powerful, interactive desktop dashboard built with PyQt5, Matplotlib, Seaborn, Plotly, and scikit-learn. Designed for data wrangling, visualization, and machine learning—all in one elegant dark-themed GUI.
This Python script filters out incorrectly formatted lines in the `lottery_numbers.csv` file and saves only the valid ones in `correct_numbers.csv`.
Powerful terminal-based tool to analyze startup funding data — filter, sort, view insights, and export results using Python and pandas.
Drawer automates single-elimination draw systems, ensuring fairness with balanced group allocation and bias-free brackets. Now enhanced with Docker, it eliminates dependency issues for seamless event management.
NASA Asteroid Data Analysis
Data exploration project introduced by Udacity Data Analysis Nanodegree
Details the data modeling techniques used, the functionality of the output, and an in-depth idea of how a plan finder works based off of user inputs.
Add a description, image, and links to the data-filtering topic page so that developers can more easily learn about it.
To associate your repository with the data-filtering topic, visit your repo's landing page and select "manage topics."