Centralized dataset management library for AI evaluation research in Elixir.
CrucibleDatasets provides a unified interface for loading, caching, evaluating, and sampling benchmark datasets (MMLU, HumanEval, GSM8K) with support for versioning, reproducible evaluation, and custom datasets.
- Unified Dataset Interface: Single API for all benchmark types
- Automatic Caching: Fast access with local caching and version tracking
- Comprehensive Metrics: Exact match, F1 score, and custom evaluation metrics
- Dataset Sampling: Random, stratified, and k-fold cross-validation
- Reproducibility: Deterministic sampling with seeds, version tracking
- Extensible: Easy integration of custom datasets and metrics
- MMLU (Massive Multitask Language Understanding) - 57 subjects across STEM, humanities, social sciences
- HumanEval - Code generation benchmark with 164 programming problems
- GSM8K - Grade school math word problems (8,500 problems)
- Custom Datasets - Load from local JSONL files
Add dataset_manager
to your list of dependencies in mix.exs
:
def deps do
[
{:crucible_datasets, "~> 0.1.0"}
]
end
Or install from GitHub:
def deps do
[
{:crucible_datasets, github: "nshkrdotcom/elixir_ai_research", sparse: "apps/dataset_manager"}
]
end
# Load a dataset
{:ok, dataset} = CrucibleDatasets.load(:mmlu_stem, sample_size: 100)
# Create predictions (example with perfect predictions)
predictions = Enum.map(dataset.items, fn item ->
%{
id: item.id,
predicted: item.expected,
metadata: %{latency_ms: 100}
}
end)
# Evaluate
{:ok, results} = CrucibleDatasets.evaluate(predictions,
dataset: dataset,
metrics: [:exact_match, :f1],
model_name: "my_model"
)
IO.puts("Accuracy: #{results.accuracy * 100}%")
# => Accuracy: 100.0%
# Load MMLU STEM subset
{:ok, mmlu} = CrucibleDatasets.load(:mmlu_stem, sample_size: 200)
# Load HumanEval
{:ok, humaneval} = CrucibleDatasets.load(:humaneval)
# Load GSM8K
{:ok, gsm8k} = CrucibleDatasets.load(:gsm8k, sample_size: 150)
# Load custom dataset from file
{:ok, custom} = CrucibleDatasets.load("my_dataset",
source: "path/to/data.jsonl"
)
# Single model evaluation
{:ok, results} = CrucibleDatasets.evaluate(predictions,
dataset: :mmlu_stem,
metrics: [:exact_match, :f1],
model_name: "gpt4"
)
# Batch evaluation (compare multiple models)
model_predictions = [
{"model_a", predictions_a},
{"model_b", predictions_b},
{"model_c", predictions_c}
]
{:ok, all_results} = CrucibleDatasets.evaluate_batch(model_predictions,
dataset: :mmlu_stem,
metrics: [:exact_match, :f1]
)
# Random sampling
{:ok, sample} = CrucibleDatasets.random_sample(dataset,
size: 50,
seed: 42
)
# Stratified sampling (maintain subject distribution)
{:ok, stratified} = CrucibleDatasets.stratified_sample(dataset,
size: 100,
strata_field: [:metadata, :subject]
)
# Train/test split
{:ok, {train, test}} = CrucibleDatasets.train_test_split(dataset,
test_size: 0.2,
shuffle: true
)
# K-fold cross-validation
{:ok, folds} = CrucibleDatasets.k_fold(dataset, k: 5)
Enum.each(folds, fn {train, test} ->
# Train and evaluate on each fold
end)
# List cached datasets
cached = CrucibleDatasets.list_cached()
# Invalidate specific cache
CrucibleDatasets.invalidate_cache(:mmlu_stem)
# Clear all cache
CrucibleDatasets.clear_cache()
All datasets follow a unified schema:
%CrucibleDatasets.Dataset{
name: "mmlu_stem",
version: "1.0",
items: [
%{
id: "mmlu_stem_physics_0",
input: %{
question: "What is the speed of light?",
choices: ["3×10⁸ m/s", "3×10⁶ m/s", "3×10⁵ m/s", "3×10⁷ m/s"]
},
expected: 0, # Index of correct answer
metadata: %{
subject: "physics",
difficulty: "medium"
}
},
# ... more items
],
metadata: %{
source: "huggingface:cais/mmlu",
license: "MIT",
domain: "STEM",
total_items: 200,
loaded_at: ~U[2024-01-15 10:30:00Z],
checksum: "abc123..."
}
}
Binary metric (1.0 or 0.0) with normalization:
- Case-insensitive string comparison
- Whitespace normalization
- Numerical comparison with tolerance
- Type coercion (string ↔ number)
CrucibleDatasets.Evaluator.ExactMatch.compute("Paris", "paris")
# => 1.0
CrucibleDatasets.Evaluator.ExactMatch.compute(42, "42")
# => 1.0
Token-level F1 (precision and recall):
CrucibleDatasets.Evaluator.F1.compute(
"The quick brown fox",
"The fast brown fox"
)
# => 0.8 (3/4 tokens match)
Define custom metrics as functions:
semantic_similarity = fn predicted, expected ->
# Your custom metric logic
0.95
end
{:ok, results} = CrucibleDatasets.evaluate(predictions,
dataset: dataset,
metrics: [:exact_match, semantic_similarity]
)
Run the included examples:
# Basic usage
mix run examples/basic_usage.exs
# Advanced evaluation workflow
mix run examples/evaluation_workflow.exs
Run the test suite:
cd apps/dataset_manager
mix test
DatasetManager/
├── DatasetManager # Main API
├── CrucibleDatasets.Dataset # Dataset schema
├── CrucibleDatasets.EvaluationResult # Evaluation result schema
├── CrucibleDatasets.Loader # Dataset loading
│ ├── Loader.MMLU # MMLU loader
│ ├── Loader.HumanEval # HumanEval loader
│ └── Loader.GSM8K # GSM8K loader
├── CrucibleDatasets.Cache # Local caching
├── CrucibleDatasets.Evaluator # Evaluation engine
│ ├── Evaluator.ExactMatch # Exact match metric
│ └── Evaluator.F1 # F1 score metric
└── CrucibleDatasets.Sampler # Sampling utilities
Datasets are cached in: ~/.elixir_ai_research/datasets/
datasets/
├── manifest.json # Index of all cached datasets
├── mmlu_stem/
│ └── 1.0/
│ ├── data.etf # Serialized dataset
│ └── metadata.json # Version info
├── humaneval/
└── gsm8k/
DatasetManager integrates with other research infrastructure libraries:
- Ensemble: Evaluate ensemble model predictions
- Bench: Statistical comparison of model performance
- ReqLLM: Generate predictions from LLM models
- Add new dataset loaders in
lib/dataset_manager/loader/
- Implement custom metrics in
lib/dataset_manager/evaluator/
- Add tests in
test/
- Update documentation
MIT License - see LICENSE file for details