Skip to content

Comments

Scoping doc catalog#65

Closed
DougManuel wants to merge 4 commits intoscoping-docfrom
scoping-doc-catalog
Closed

Scoping doc catalog#65
DougManuel wants to merge 4 commits intoscoping-docfrom
scoping-doc-catalog

Conversation

@DougManuel
Copy link
Contributor

This document describes the proposed metadata for describing the database.

On rereading, I am not sure "catalog" is the best name, and I have a few other comments that I'll add during review.

@DougManuel DougManuel requested a review from yulric April 30, 2025 14:04
@DougManuel
Copy link
Contributor Author

The high level scope for the PR is now in the main scope document. See PR #50

We'll close this PR, and can bring back parts when we write the specifications. An example of key specifications are below.

Use cases

Primary use cases

  1. Metadata management
    • Enable a structured and extensible format for documenting dataset-level metadata fields, such as title, description, creator, and license.
    • Ensure metadata is easily accessible and modifiable.
  2. Print and summary
    • Provide intuitive functions for displaying dataset metadata alongside data summaries to give users a comprehensive overview.
  3. Data dictionary generation
    • Support workflows to combine metadata from catalog, variables, and variable_details into a comprehensive data dictionary for documentation and sharing.
  4. Interoperability
    • Use CSV as the primary format for importing and exporting metadata.
    • Allow future consideration for supporting standards like DDI and PMML.
  5. Alignment with sidecars and attributes
    • Integrate seamlessly with workflows by attaching metadata to datasets via attributes, similar to existing recodeflow metadata structures.

Naming and design alignment

Why "catalog"?

The term catalog aligns with widely used metadata standards like DCAT (Data Catalog Vocabulary), which employs "Catalog" and "Dataset" as primary constructs. Unlike alternatives such as study (tightly coupled to studies) or dublin_core (too specific), catalog provides a broader and more intuitive framework for managing dataset-level metadata.

Integration with sidecars and attributes

  • The catalog object will function as a "sidecar," storing metadata in a separate structure and attaching it to datasets via attributes.
  • This design ensures metadata is modular, flexible, and consistent with recodeflow’s existing approach.

Existing landscape and interoperability considerations

Existing R packages

  1. tm
    • Provides Dublin Core metadata support for text corpora.
    • Lightweight but text-focused and less relevant for tabular datasets.
  2. dataset
    • Implements Dublin Core metadata for structured data objects.
    • Lacks widespread adoption and flexibility.

Metadata standards

  1. Dublin Core
    • A well-established standard for dataset metadata, including fields like title, creator, and description.
  2. DCAT
    • A W3C standard for data catalog metadata, building on Dublin Core.
    • Adds elements like Catalog, Dataset, and Distribution to support web interoperability.

Considerations for tm and dataset

While direct support for tm or dataset is not planned, their relevance should be revisited if workflows expand to include text data or complex metadata sharing. Interoperability with these packages may require further discussion.


Proposed schema for the catalog object

Field Description Example
title Name of the dataset/catalog "Health Survey 2024"
description Detailed description of the dataset "Survey data on public health metrics."
creator Person or organization responsible for the data "RecodeFlow Team"
publisher Organization publishing the data "Public Health Agency"
subject Topics covered by the dataset "Demography, Health"
date_created When the dataset was created "2024-01-15"
date_modified Last modification date "2024-11-29"
version Dataset version "1.0"
license Licensing information "CC-BY 4.0"
contact_point Contact for questions about the dataset "support@example.org"

Functions for the catalog object

Core functions

  1. Attach metadata to data
    • set_catalog(data, catalog): Attach a catalog object to a data frame as an attribute.
  2. Retrieve metadata
    • get_catalog(data): Retrieve the catalog object from a data frame.
  3. Print and summary
    • print.catalog(x): Display catalog metadata.
    • summary.catalog(x): Summarize metadata for quick inspection.

Utility functions

  1. Access or modify fields
    • catalog_field(catalog, field): Access a specific field in the catalog object.
    • set_catalog_field(catalog, field, value): Update a specific field in the catalog object.
  2. Integration into workflows
    • Combine with variables and variable_details for data dictionary generation.

@DougManuel DougManuel closed this Apr 30, 2025
@DougManuel DougManuel deleted the scoping-doc-catalog branch April 30, 2025 16:17
@DougManuel DougManuel mentioned this pull request Apr 30, 2025
DougManuel added a commit that referenced this pull request Jun 22, 2025
- Implements Dublin Core standard with 10 core fields from PRs #65 and #43
- Follows three-file architecture with registry reference for DRY principles
- Includes recodeflow-specific extensions for workflow integration
- Supports metadata file naming conventions and validation rules
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants