Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/.quarto/
130 changes: 130 additions & 0 deletions docs/catalog.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
title: Scoping
format:
html:
embed-resources: true
---

# Scope and specifications for the `catalog` object in recodeflow

## Introduction

The `catalog` object is proposed as a structured solution for managing dataset-level metadata in **`recodeflow`**. It follows recodeflow's modular sidecar metadata structures and methods (e.g., `variables` and `variable_details`), and emphasizes interoperability, usability, and extensibility. With its lightweight design, the `catalog` object will support CSV-based workflows, print and summary features, and integration into existing metadata management processes.

------------------------------------------------------------------------

## Use cases

### Primary use cases

1. **Metadata management**
- Enable a structured and extensible format for documenting dataset-level metadata fields, such as title, description, creator, and license.
- Ensure metadata is easily accessible and modifiable.
2. **Print and summary**
- Provide intuitive functions for displaying dataset metadata alongside data summaries to give users a comprehensive overview.
3. **Data dictionary generation**
- Support workflows to combine metadata from `catalog`, `variables`, and `variable_details` into a comprehensive data dictionary for documentation and sharing.
4. **Interoperability**
- Use CSV as the primary format for importing and exporting metadata.
- Allow future consideration for supporting standards like DDI and PMML.
5. **Alignment with sidecars and attributes**
- Integrate seamlessly with workflows by attaching metadata to datasets via attributes, similar to existing `recodeflow` metadata structures.

------------------------------------------------------------------------

## Naming and design alignment

### Why "catalog"?

The term **`catalog`** aligns with widely used metadata standards like DCAT (Data Catalog Vocabulary), which employs "Catalog" and "Dataset" as primary constructs. Unlike alternatives such as **`study`** (tightly coupled to studies) or **`dublin_core`** (too specific), **`catalog`** provides a broader and more intuitive framework for managing dataset-level metadata.

### Integration with sidecars and attributes

- The `catalog` object will function as a "sidecar," storing metadata in a separate structure and attaching it to datasets via attributes.
- This design ensures metadata is modular, flexible, and consistent with `recodeflow`’s existing approach.

------------------------------------------------------------------------

## Existing landscape and interoperability considerations

### Existing R packages

1. **`tm`**
- Provides Dublin Core metadata support for text corpora.
- Lightweight but text-focused and less relevant for tabular datasets.
2. **`dataset`**
- Implements Dublin Core metadata for structured data objects.
- Lacks widespread adoption and flexibility.

### Metadata standards

1. **Dublin Core**
- A well-established standard for dataset metadata, including fields like `title`, `creator`, and `description`.
2. **DCAT**
- A W3C standard for data catalog metadata, building on Dublin Core.
- Adds elements like `Catalog`, `Dataset`, and `Distribution` to support web interoperability.

### Considerations for `tm` and `dataset`

While direct support for **`tm`** or **`dataset`** is not planned, their relevance should be revisited if workflows expand to include text data or complex metadata sharing. Interoperability with these packages may require further discussion.

------------------------------------------------------------------------

## Proposed schema for the `catalog` object

| Field | Description | Example |
|------------------|-----------------------------|-------------------------|
| `title` | Name of the dataset/catalog | "Health Survey 2024" |
| `description` | Detailed description of the dataset | "Survey data on public health metrics." |
| `creator` | Person or organization responsible for the data | "RecodeFlow Team" |
| `publisher` | Organization publishing the data | "Public Health Agency" |
| `subject` | Topics covered by the dataset | "Demography, Health" |
| `date_created` | When the dataset was created | "2024-01-15" |
| `date_modified` | Last modification date | "2024-11-29" |
| `version` | Dataset version | "1.0" |
| `license` | Licensing information | "CC-BY 4.0" |
| `contact_point` | Contact for questions about the dataset | "support\@example.org" |

------------------------------------------------------------------------

## Functions for the catalog object

### Core functions

1. **Attach metadata to data**
- `set_catalog(data, catalog)`: Attach a `catalog` object to a data frame as an attribute.
2. **Retrieve metadata**
- `get_catalog(data)`: Retrieve the `catalog` object from a data frame.
3. **Print and summary**
- `print.catalog(x)`: Display catalog metadata.
- `summary.catalog(x)`: Summarize metadata for quick inspection.

### Utility functions

1. **Access or modify fields**
- `catalog_field(catalog, field)`: Access a specific field in the `catalog` object.
- `set_catalog_field(catalog, field, value)`: Update a specific field in the `catalog` object.
2. **Integration into workflows**
- Combine with `variables` and `variable_details` for data dictionary generation.

------------------------------------------------------------------------

## Interoperability

### CSV as the primary format

- Metadata will be imported and exported using CSV files, consistent with workflows for `variables` and `variable_details`.

## **Future considerations**

1. **DDI and PMML support**

- Importing and exporting to XML-based formats like DDI or PMML could be explored for interoperability with broader metadata ecosystems.

2. \*\*Roxygen support (CRAN, help(), pgkdown)

- Generate or export to data.R RD description code. Importing is challenging because all metadata is unstructured with the `@description` attribute.

2. **Potential discussion for package interoperability**

- Further consideration could be given to leveraging packages like tm or dataset to expand compatibility for specific workflows.
138 changes: 138 additions & 0 deletions docs/scope.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
---
title: Scoping
format:
html:
embed-resources: true
---

```{r}
```

This document goes through the scope of the recodeflow library by explaining the problems it will aim to solve and those that are outside its purview.

## Context

One of the first steps in any quantitative study is the selection and creation of variables that will be used to answer the research question. However, this process is not usually conducted in a manner that is open and transparent resulting in studies that are:

1. **Inefficient**: Previous work done by research teams are rarely transferable to new studies, even if they use the same variables.
2. **Non-reproducible**: The lack of transparency makes it impossible for a published study to be reproduced without help from the original authors. If enough time has passed since publication, even the original authors may not remember how the study variables were created.

This issue is especially a problem in studies developing predictive algorithms. The lack of an open and transparent process and the ever increasing complexity of these algorithms makes it impossible to not only reproduce but also score these algorithms on new data.

The recodeflow library aims to alleviate these issues within the domain of variable selection/creation.

## Scope

### Variable metadata {#variable-metadata}

All the metadata for variable selection and creation should be encoded in a format that is open and machine actionable, allowing it to be easily published along with the main paper. Including this information with the published paper is the first step in making the study easily reproducible. Examples of such metadata include variable labels, category labels, the logic for the creation of variables etc. However, this information is rarely included. Studies developing predictive algorithms are especially hard to reproduce without this information due to the large number of and complexity of the variables included.

The library should aim to fix this problem by developing a schema in an open and machine actionable format that enables researchers to encode this metadata. Examples of open and machine actionable formats include CSV, YAML, and JSON. Examples of formats that are open but not machine actionable are plain text files. Microsoft Excel is an example of a format that is machine actionable but not open. Microsoft Word is an example of a format that is neither open nor machine actionable.

### Transformation software

Once the study variables have been selected, the usual next step in the process is the creation of the study dataset in a programming environment such as R or Python. However, this translation into code can be error prone for example due to coding errors or misunderstandings between the individual who defines the variables and the one who codes them.

The library should aim to fix this problem by developing software that automatically creates a study dataset using the information provided by the investigator in the [above mentioned schema](#variable-metadata). To that end the schema should include all the information required by the software to create the variables in the study dataset. In addition, using a machine actionable format is vital.

This software will allow for a clear separation between the processes of defining the study variables and using these definitions to create the study dataset, making it easier for investigators and analysts to work independent of each other.

### Long to wide and wide to long

A dataset can come in one of two formats:

1. **Long format**: An observation can take up multiple rows.
2. **Wide format**: An observation takes up a single row.

There are different reasons for choosing one format over the other but the wide data format is usually easier to understand by humans whereas the long data format is preferred for analysis by a computer.

For example, consider a dataset that looks at blood pressure readings over time.

The wide data format could look like:

```{csv}
Patient Jan_BP Feb_BP Mar_BP
Alice 120 122 119
Bob 115 118 121
```

whereas the long format version could look like,

```{csv}
Patient Month BP
Alice Jan 120
Alice Feb 122
Alice Mar 119
Bob Jan 115
Bob Feb 118
Bob Mar 121
```

Notice how in the wide format all the observations for a single patient are on one row whereas in the long format its split between three rows.

When creating a study dataset analysts may need to convert the original dataset from one format to another which the library should support.

### Escape hatches

Recognizing that not all transformations can be encoded in the defined data format perhaps because of its complexity or because it would require a change to the library that is yet to be done, the library should allows users to bypass the normal method of creating a transformation.

### Missing values

All data has some of its values missing which makes it important for the library to allows its users to represent it in the transformation metadata. In addition missing values can come in different flavours for example most commonly due to non-response, but also due to the question not being asked, not being in a database etc. which means the library should allow the user to tag the missing value with its type. Finally, recognizing that the final dataset will most probably be used in another libraries and functions, the library should to its best ability try to allow the missing values to be propoagates in other parts of the code.

### Additional metadata

### Import/Export

> Importing and exporting metadata from/to other sources or adding metadata should be easy.

recodeflow stores metadata in `variable` and `variable_details` as a metadata 'sidecar', with three commonly used metadata (variable labels, value labels and tagged NA) stored in attr() using the approach of `haven` and `labelled`.

Getting metadata to and from these two locations (sidecar and attr()) should be easy, and we should consider supporting a few common standards and workflows.

Examples of importing metadata.

1. Statistical languages that store metadata in their file. SAS, Stata. Haven imports a limited amount of metadata. Variable and value labels.
2. CSV files and Excel spreadsheets. This is probably the most common format for storing metadata in a spreadsheet without a well-established format.
3. Established metadata sources: Dublin Core, DDI, etc.

Examples that, unfortunately, aren't working well with the R ecosystem.

The R language supports metadata for tables as roxygen, which can then be used with CRAN, help(), and Pkgdown. This is a great start, but there isn't an established practice for structuring the description file. This means it is usually an arduous task to cut and paste variable and value labels. Support rd file parsing may be challenging, but we could more robustly generate data description files and do so in a more standard approach.

Doug I'm not sure what you meant by the import part. Did you mean:

1. Import the metadata from an external data source into the variables and variable details sheets. For example, a user would be able to update the label column with the metadata from a DDI format.
2. Import and tag the harmonized dataset with metadata from an external file. This doesn't make much sense to me since all the variables in the harmonized dataset are created by us so there should not be any metadata out there for it.

For exporting you meant the library should be able to export the metadata from the variables and variable details sheet into another format like DDI.

### Easy to use

"Metadata should be easy to use. You should, for example, be able to make data dictionaries at any point in your project"

Did you envision that the library would have functions to create a data dictionary or would have helper functions for it?

I think the first point make sense.

### No duplicates

"You should be able to, “type {metadata} once and use many times” including maintaining metadata in machine-actionable form in other projects (journal figures, predictive algorithm deployment, etc.)"

I assume you mean helper functions here like get the label for a variable, get the labels for a category etc. Perhaps a internal data format for what is in the variables and variable details sheet that is easier to navigate using code?

### Variables transformation information

"Metadata includes data transformation information (i.e. the variable is a spline, dummy variable, interaction, etc.)"

I see this as being in another library for developing predictive algorithms. The other library can build on whats in the variables and variable details sheet to accomplish this purpose.

### Roles

"Building on the above, identify potential ‘roles’ of variables in the project. Roles are similar to the same as for tidymodels. An example role are variables used in table 1, or variables used as explanatory variables in a model."

Look at the above point I made.

## Out of scope

### Multiple table datasets