A comprehensive Python library for unified text dataset management and processing especially for large language models. Load text from JSON/JSONL files or directories, apply filtering and sampling, iterate sequentially or randomly, and batch output with ease.
π δΈζ
- π Multi-format Support: JSON, JSONL, and directory parsing
- π― Smart Text Extraction: Built-in ShareGPT format support
- π Advanced Filtering: Multi-dimensional length filtering (words/chars/tokens)
- π² Flexible Sampling: Random, sequential, and specified index sampling
- π Multiple Iteration Modes: Sequential and random batch processing
- π Statistics & Analytics: Built-in dataset statistics
- ποΈ Highly Configurable: Comprehensive configuration system
- π Performance Optimized: Efficient processing with optional caching
dataset.py: CoreTextDatasetclass for indexing, filtering, shuffling, sampling, iteration, and batch outputio.py: Data source traversal and JSON/JSONL parsing utilitiesextractors.py: Text extraction from raw entries, ShareGPT format supportfilters.py: Length-based and custom predicate filteringsamplers.py: Random, sequential, and specified index samplerstypes.py: Configuration data classes (FilterConfig,SampleConfig)extract_method.py: Extension points for extraction methods
pip install -r requirements.txtfrom dataset_management import TextDataset, FilterConfig
# Create dataset and build index (automatically parses JSON/JSONL files in directory)
ds = TextDataset(source='data/', is_sharegpt=True, seed=42)
ds.build_index()
# Filter: 30-50 characters, must contain Chinese characters
import re
def has_chinese(t: str) -> bool:
return re.search(r'[\u4e00-\u9fff]', t) is not None
ds.filter(FilterConfig(min_len=30, max_len=50, unit='chars', predicate=has_chinese))
# Shuffle and get texts
ds.shuffle()
texts = ds.textsFilterConfig Parameters:
min_len/max_len: Length range (optional)unit:'words' | 'chars' | 'tokens'(count by words/characters/token)predicate: Custom boolean function, texts returningTrueare kept
Examples:
# Filter by word count
ds.filter(FilterConfig(min_len=10, max_len=100, unit='words'))
# Filter by character count
ds.filter(FilterConfig(min_len=50, max_len=500, unit='chars'))
# Custom predicate filtering
def is_english(t: str) -> bool:
return re.match(r'^[a-zA-Z\s]+$', t) is not None
ds.filter(FilterConfig(predicate=is_english))SampleConfig Parameters:
n: Number of samples to selectmode:'random' | 'sequential' | 'specified'- Other:
seed,replace,indices
Examples:
from dataset_management.types import SampleConfig
# Random sample 100 texts
samples = ds.select(SampleConfig(n=100, mode='random', seed=123))
# Sequential sample all texts
samples = ds.select(SampleConfig(mode='sequential'))
# Sample specific indices
samples = ds.select(SampleConfig(n=50, mode='specified', indices=[0, 2, 5, 10]))# Get dataset statistics
stats = ds.stats()
print(f"Count: {stats['count']}")
print(f"Min words: {stats['min_words']}")
print(f"Max words: {stats['max_words']}")
print(f"P50 words: {stats['p50_words']}")
print(f"P90 words: {stats['p90_words']}")# Simple iteration
for text in ds.iter(mode='sequential'):
print(text)
# Cyclical iteration (loops forever)
for text in ds.iter(mode='sequential', cycle=True):
print(text) # Will continue indefinitely# Sequential batches
for batch in ds.get_batch(batch_size=32, mode='sequential', drop_last=False):
# Process batch (List[str])
process_batch(batch)
# Random batches (shuffled)
for batch in ds.get_batch(batch_size=64, mode='random', drop_last=True):
# Process batch (List[str])
process_batch(batch){
"text": "Hello, world!",
"metadata": {
"source": "dataset1"
}
}{"text": "First document"}
{"text": "Second document"}
{"text": "Third document"}{
"conversation": [
{
"human": "What is the capital of France?",
"gpt": "The capital of France is Paris."
}
]
}from dataset_management import TextDataset, FilterConfig, SampleConfig
import re
# Load ShareGPT format chat data
ds = TextDataset(source='chat_data/', is_sharegpt=True, seed=42)
ds.build_index()
print(f"Original dataset size: {len(ds.texts)}")
# Filter out very short or very long conversations
ds.filter(FilterConfig(
min_len=20, # At least 20 characters
max_len=1000, # Max 1000 characters
unit='chars'
))
# Filter for conversations containing questions
def has_question(t: str) -> bool:
return '?' in t
ds.filter(FilterConfig(predicate=has_question))
# Sample 1000 high-quality conversations
samples = ds.select(SampleConfig(n=1000, mode='random', seed=123))
print(f"Filtered dataset size: {len(samples)}")
# Process in batches for training
for batch in ds.get_batch(batch_size=64, mode='random'):
# Batch contains filtered and shuffled texts
train_model(batch)from transformers import AutoTokenizer
# Load tokenizer for token-based filtering
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
ds = TextDataset(source='data/', tokenizer=tokenizer, seed=42)
ds.build_index()
# Filter by token count (e.g., for language model training)
ds.filter(FilterConfig(min_len=128, max_len=512, unit='tokens'))Main class for dataset management.
Constructor Parameters:
source: Data source path (single file or directory)is_sharegpt: Whether to parse in ShareGPT formattokenizer: Optional tokenizer for token-based filteringcache: Whether to cache parsed results (future extension)seed: Random seed for reproducible shuffling and sampling
Key Methods:
build_index(drop_empty=True): Parse and build internal text indexfilter(cfg: FilterConfig): Filter by length range and custom predicatesshuffle(seed: Optional[int] = None): Shuffle text listselect(cfg: SampleConfig) -> List[str]: Sample texts according to configurationiter(mode='sequential', cycle=False): Sequential iterationget_batch(batch_size: int, mode='sequential'|'random', drop_last=False): Batch iterationstats() -> dict: Get dataset statistics
Configuration for text filtering.
Parameters:
min_len: Minimum length thresholdmax_len: Maximum length thresholdunit: Length unit ('words','chars', or'tokens')predicate: Custom boolean function for filtering
Configuration for text sampling.
Parameters:
n: Number of texts to samplemode: Sampling mode ('random','sequential', or'specified')indices: Specific indices for'specified'modereplace: Whether to sample with replacementseed: Random seed for reproducible sampling
- Data source
sourcecan be a single file or directory; all.json/.jsonlfiles in the directory will be parsed - When
is_sharegpt=True, texts are extracted according to ShareGPT format specifications - For token-based filtering, a usable
tokenizermust be provided toTextDataset build_index()performs basic empty text removal and random shuffling (affected byseed)- The library maintains text order consistency during filtering and sampling operations
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
π For detailed Chinese documentation, see README_zh.md