Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools.
Croissant builds on schema.org, and its Dataset vocabulary, a widely used format to represent datasets on the Web, and make them searchable.
Croissant is currently under development by the community.
Datasets are the source code of machine learning (ML), but working with ML datasets is needlessly hard because each dataset has a unique file organization and method for translating file contents into data structures and thus requires a novel approach to using the data. We need a standard dataset format to make it easier to find and use ML datasets and especially to develop tools for creating, understanding, and improving ML datasets.
Croissant 🥐 is a high-level format for machine learning datasets. Croissant brings together four rich layers (in a tasty manner, we hope 😉):
- Metadata: description of the dataset, including responsible ML aspects
- Resources: one or more files or other sources containing the raw data
- Structure: how the raw data is combined and arranged into data structures for use
- ML semantics: how the data is most often used in an ML context
Here is an extremely simple example of the croissant format, with comments showing the four layers:
{
"@type": "sc:Dataset",
"name": "minimal_example_with_recommended_fields",
"description": "This is a minimal example, including the required and the recommended fields.",
"license": "https://creativecommons.org/licenses/by/4.0/",
"url": "https://example.com/dataset/recipes/minimal-recommended",
"distribution": [
{
"@type": "sc:FileObject",
"name": "minimal.csv",
"contentUrl": "data/minimal.csv",
"encodingFormat": "text/csv",
"sha256": "48a7c257f3c90b2a3e529ddd2cca8f4f1bd8e49ed244ef53927649504ac55354"
}
],
"recordSet": [
{
"@type": "ml:RecordSet",
"name": "examples",
"description": "Records extracted from the example table, with their schema.",
"field": [
{
"@type": "ml:Field",
"name": "name",
"description": "The first column contains the name.",
"dataType": "sc:Text",
"references": {
"distribution": "minimal.csv",
"extract": {
"column": "name"
}
}
},
{
"@type": "ml:Field",
"name": "age",
"description": "The second column contains the age.",
"dataType": "sc:Integer",
"references": {
"distribution": "minimal.csv",
"extract": {
"column": "age"
}
}
}
]
}
]
}
- Github Repo
- Specification
- Examples
- Verifier
- Shared Drive
- Requirements Document
- Responsible AI Approach
- Join the mailing list
- Attend Croissant meetings (please joint the list to automatically receive the invite)
- File issues for bugs for feature requests
- Contribute code (please sign the MLCommons Association CLA first!)
- Datasets Search crawls and indexes Croissant JSON-LD files on the web and provides a filter to restrict results to Croissant datasets.
- Kaggle embeds Croissant JSON-LD directly in their HTML, and also provides the following ways to download the Croissant JSON-LD file:
- Via an
Export metadata as Croissant
button on the dataset's page (ex: https://www.kaggle.com/datasets/unsdsn/world-happiness) - Via download URL (ex: https://www.kaggle.com/datasets/unsdsn/world-happiness/croissant/download)
- Via an
- OpenML offers a
Croissant
button on all of their datasets to download the underlying Croissant JSON-LD file. - Hugging Face offers an API endpoint to build a Croissant JSON-LD.
- TFDS has a
CroissantBuilder
to transform any JSON-LD file into a TFDS dataset, which makes it possible to load the data into TensorFlow, JAX and PyTorch.
Croissant project code and examples are licensed under Apache 2.
Croissant is being developed by the community as a Task Force of the MLCommons Association Datasets Working Group. The Task Force is open to anyone (as is the parent Datasets working group). The Task Force is co-chaired by Omar Benjelloun and Elena Simperl.
Albert Villanova (Hugging Face), Andrew Zaldivar (Google), Baishan Guo (Meta), Carole Jean-Wu (Meta), Ce Zhang (ETH Zurich), Costanza Conforti (Google), D. Sculley (Kaggle), Dan Brickley (Schema.Org), Eduardo Arino de la Rubia (Meta), Edward Lockhart (Deepmind), Elena Simperl (King's College London), Goeff Thomas (Kaggle), Joaquin Vanschoren (TU/Eindhoven, OpenML), Jos van der Velde (TU/Eindhoven, OpenML), Julien Chaumond (Hugging Face), Kurt Bollacker (MLCommons), Lora Aroyo (Google), Luis Oala (Dotphoton), Meg Risdal (Kaggle), Natasha Noy (Google), Newsha Ardalani (Meta), Omar Benjelloun (Google), Peter Mattson (MLCommons), Pierre Marcenac (Google), Pierre Ruyssen (Google), Pieter Gijsbers (TU/Eindhoven, OpenML), Prabhant Singh (TU/Eindhoven, OpenML), Quentin Lhoest (Hugging Face), Steffen Vogler (Bayer), Taniya Das (TU/Eindhoven, OpenML)
Thank you for supporting Croissant! 🙂