Skip to content

Conversation

e-lo
Copy link
Contributor

@e-lo e-lo commented Oct 7, 2022

This pull request:

  • adds example data
  • updates validation scripts
  • adds documentation for adding data and validating it

Use this instead because it is more configurable/extensible than the redirects.
1. Add example data (imperfect for now) generated by script `create_example` in data/example/data
2. Update validate-data GH action to refer to correct repos.
@e-lo e-lo self-assigned this Oct 7, 2022
@e-lo
Copy link
Contributor Author

e-lo commented Oct 7, 2022

Issues that are still not working:

  1. References in datapackage.json are to githubusercontent.com, not local...not sure how to fix?
  2. Not sure how to validate a single file using CLI to the schema
  3. GH Action?
  4. tides.spec.json should probably be a profile?

Would love thoughts form @botanize and @j-meelah

@botanize
Copy link
Contributor

botanize commented Oct 7, 2022

High-level, I'm not sure about the utility of generating random data to validate. It's a bit circular, and I'm not sure there's any real benefit. For example, could the random data generation script help anyone translate their existing CAD/AVL data into TIDES?

  1. References in datapackage.json are to githubusercontent.com, not local...not sure how to fix?

It doesn't seem like these references are a problem? Though the profile on your datapackage.json should be tabular-data-package.

  1. Not sure how to validate a single file using CLI to the schema

frictionless validate data/example/data/vehicle_locations --schema spec/vehicle_locations.schema.json

  1. GH Action?

For validation? why not just frictionless validate data/example/data/datapackage.json? Can even output json or yaml (--json, --yaml)

  1. tides.spec.json should probably be a profile?

I'm not sure anything in the spec matches what we provided in tides.spec.json. Everything in the spec that describes a collection requires data resources. Maybe we could make a profile that describes a collection of tableSchema? Seems like a lot of work, and the only work tides.spec.json is doing is listing the remaining files in spec (description is in the table schema, path is nonsense, required is always false). I recommend removing tides.spec.json.

I started producing example data from Metro Transit in the jpr_sample_data branch. It's kind of a lot of work to produce! Do you want to look at that and see how we can merge our work?

@e-lo
Copy link
Contributor Author

e-lo commented Oct 7, 2022

High-level, I'm not sure about the utility of generating random data to validate. It's a bit circular, and I'm not sure there's any real benefit.

I agree - I think the real benefit is generating the blank CSVs to "fill out" if you want to. I will remove the random stuff :-)

@e-lo
Copy link
Contributor Author

e-lo commented Oct 7, 2022

It doesn't seem like these references are a problem?

I was trying to figure out how to make a local reference, which doesn't seem to be as easy in the current structure b/c navigating folders w/out code is...

My goal was to be able to validate the example data against the spec/schemas which were in that branch so that we could evaluate any changes to data that any spec changes brought about.

@e-lo
Copy link
Contributor Author

e-lo commented Oct 7, 2022

I recommend removing tides.spec.json.

I kind of agree. For now, I'll make an issue( #76 ) for us to ponder it.

e-lo added 3 commits October 7, 2022 10:42
- Linted
- Auto-generated tabular-data-package datapackage.json from script
- Removed random data and the scripts to generate it, kept csv templates
- Added readmes and documentation about each example in directory
-
@e-lo e-lo requested review from botanize and j-meelah October 7, 2022 18:02
Copy link
Contributor

@botanize botanize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a ton of infrastructure to add a few empty files.

How valuable is it to have a sample that validates no matter what the spec is when the sample is not representative of real data?

I don't love calling the base directory for the samples "data", much prefer "samples".

`scripts\create_example.py` has some template code which can help with the following

- `write_schema_examples()`: will generate blank csvs according to the TIDES schema
- `write_datapackage()` will generate a datapackage.json based on the TIDES schemas and a set of defaults specified in th script.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer a datapackage.json template instead of a function. Honestly, I'm still trying to figure out what the value of an empty generated example is that wouldn't be better served by a real sample?

Copy link
Contributor Author

@e-lo e-lo Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the functions serve much purpose other than for developers to be able to auto-generate the example datapackage.json and CSV templates if/when the spec changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what value is a CSV template? It's just a header row, which is trivial if you're formatting data from another source anyway. Like I guess you could tell python/R to append to the empty csv instead of writing a header, but that saves essentially no effort or may even create a tiny amount of work (e.g., fwrite(vehicle_locations, 'vehicle_locations.csv', append = TRUE) vs fwrite(vehicle_locations, 'vehicle_locations.csv').

@e-lo
Copy link
Contributor Author

e-lo commented Oct 7, 2022

image

This would be cool if it worked, but it doesn't seem to be (or I can't get it to):
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2
      - name: Validate data
        uses: frictionlessdata/repository@v2
        with:
          packages: "data/*/data/datapackage.json"

Fails with
image

- Rendered samples as a table w/out readmes
- Renamed data folders to samples and TIDES
- Added metadata to datapackage.json about vendors, github handles, NTDID, agency name and exposed in documentation
- Updated documentation for consistency
- Renamed 'example' to 'template' to be more consistent with what it is
@e-lo e-lo marked this pull request as ready for review October 7, 2022 23:44
@e-lo e-lo requested a review from botanize October 7, 2022 23:45
@botanize
Copy link
Contributor

botanize commented Oct 8, 2022

spaces in urls! here's a link that works https://repository.frictionlessdata.io/pages/dashboard.html?user=TIDES-transit&repo=TIDES&flow=validate+example+tides+data&run=3206719404

But what's really annoying about that link is that it's just a way for them to monitor usage? The artifact is in our repo, so just look there, e.g., https://github.com/TIDES-transit/TIDES/suites/8673442632/artifacts/390626873

Copy link
Contributor

@botanize botanize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still struggling to see what purpose CSVs with only header rows can serve. With that in mind, I don't really think a "template" directory is very useful at all. I think the simplest and most effective thing to do is to embed an example dataresource.json file into /samples/README.md and remove all of /samples/template.

Comment on lines 53 to 62
| `sources` | Array of data sources formatted as a [`source`](#data-source) | Recommended |

### Data Source

| **Field** | **Description** | **Required** |
| --------- | --------------- | ------------ |
| `title` | Description of the data source. | Required |
| `component` | What technology component was used to generate this data (directly or indirectly)? Examples include `AVL`, `APC`, `AFC`, etc. | Recommended |
| `product` | What product was used to generate this data (directly or indirectly)? | Recommended |
| `vendor` | What company makes this product? | Recommended |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the primary use for this if for people to see at a glance if the example they're looking at is comparable to their system—this level of precision, with vendor, product and component per resource is overkill and leads to a lot of duplication in the dataresource.json file.

For example, in our system vehicle_locations, stop_visits, passenger_events, station_activities, trips_performed would all be the same vendor/product (Vontas/TransitMaster), and the component would be an array (["AVL", "APC"]). Furthermore, AVL, APC, AFC are all implied in each table, by definition.

I suggest we move the sources array to the top level of the data package object and simplify the source object to source, and components. For example,

{
  "title": "Metro Transit Basic",
  "name": "metrotransitmn-basic",
  "agency": "Metro Transit",
  "ntd_id": 50027,
  "sources": [
    { "source": "TransitMaster", "components": ["APC", "AVL"]},
    { "source": "emtrac", "components": "AVL"},
    { "source": "Cubic", "components": "AFC"}
  ],
  "resources": [
    {
      "path": "TIDES/vehicle_locations.csv",
      "profile": "tabular-data-resource",
      "schema": "https://raw.githubusercontent.com/TIDES-transit/TIDES/main/spec/vehicle_locations.schema.json"
    }
  ]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main value I see in putting it under each resource is to explicitly say where each piece of the "TIDES pie" is coming from so that agencies that are trying to mimic another agency's a data flow diagram can do so. It's not necessarily implicit which resource comes from which product. For example, there are many agencies that receive APC data from their "APC product" that is then manipulated by another product (including internal scripts) and then spat out to other systems. Exposing this seems like it would add a great deal of value. That said, there is no reason why you can't just put all the sources up top....especially if we aren't articulating a profile which would require it and are merely suggesting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It could be nested, big-picture at the top-level, details for each resource, both optional.

Comment on lines 38 to 41
### Template

A `datapackage.json` template is available at [`/data/template/TIDES/datapackage.json`](https://raw.githubusercontent.com/TIDES-transit/TIDES/main/data/template/TIDES/datapackage.json).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to

    ### Example datapackage.json
    ```
    {
      "title": "Example Title",
      "name": "example-name",
      "description": "a description goes here, can include markdown formatting",
      "agency": "Metro",
      "ntd_id": 00000,
      "profile": "tabular-data-package",
      "licenses": [{"name": "Apache-2.0"}],
      "contributors": [
        {"title": "Name Name", "github": "my_handle", "email": null}
      ],
      "resources": [
        {
          "name": "vehicle_locations",
          "path": "TIDES/vehicle_locations.csv",
          "schema": "https://raw.githubusercontent.com/TIDES-transit/TIDES/main/spec/vehicle_locations.schema.json"
        }
      ]
    }
    ```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this is better when we can include the file in the markdown? I'd prefer to manage data as data, not as data within markdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not data!—it's documentation, an example that doesn't even have meaningful values. Documentation for just about every software tool is littered with example code and configuration, this is exactly the same.

If you want to include a file, go ahead, but I do think that at the very least the file should have an obviously different name, e.g., datapackage.json.template or datapackage.sample or something like that so that it's not accidentally interpreted as a real data package description.

e-lo added 4 commits October 10, 2022 10:00
- changed workflow title to fix link with spaces
- moved datapackage.json up one directory, updated docs, updated source refs
- added missing --schema flag
@e-lo
Copy link
Contributor Author

e-lo commented Oct 10, 2022

All, There are two things @botanize are in (what I would hope we both categorize as) healthy disagreement about that it might be useful for the broader community to weigh in on:

  1. Sources: should we document them at the resource- or datapackage-level? (see discussions above)
  2. Templates: should we include datapackage.json and csv templates that are auto-generated from the spec? Or just include documentation about datapackage.json in README.md? Or some middle ground? (see discussions above)

cc: @jlstpaul @hunterowens ...other?

@e-lo e-lo requested a review from botanize September 20, 2023 18:01
@e-lo
Copy link
Contributor Author

e-lo commented Sep 20, 2023

Updated this PR to latest main and it is passing and I think is ready to merge @botanize ?

Copy link
Contributor

@botanize botanize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi E,

I think I've pointed out all the errors and inconsistencies, but you may want to double-check the syntax and usage of the scripts in bin. I also think we need more documentation of those scripts.

I'm also worried that the contents of samples/template/TIDES doesn't validate.

README.md Outdated
Specific files can be validated by running the frictionless framework against them and their corresponding schemas as follows:

```sh
frictionless validate vehicles.csv --schema https://raw.githubusercontent.com/TIDES-transit/TIDES/main/spec/schema.vehicles.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires --schema-sync, or optional fields that are left out will cause validation errors

README.md Outdated

### Template

Templates of `datapackage.json` and each TIDES file type are located in the `/samples/template` directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to mention how these should be used?

# Script: validate_frictionless_package.sh
# Description: Bash script to validate a Frictionless Data Package using the Frictionless CLI.
# Usage: validate_frictionless_package.sh [-v tides_version | -l local_schema_location] [-d dataset_location]
# -v tides_version: Optional. Specify the version of the TIDES specification or 'local' to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this parameter necessary? Shouldn't this always use the schema specified in the datapackage or treat -l as an override? In other words, if -l is provided, override the datapackage, otherwise use the datapackage, no need for -v.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, it seems like -l should be -s, which lets you set any schema location, overriding any datapackage. It wouldn't need to be local, could also be a GitHub location or NFS or samba, &c.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the samples/template/TIDES doesn't validate, should it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • this file is not marked executable
  • how does it differ from bin/validate-datapackage, should it replace it?

Several other validation scripts and tools with more flexibility such as validating to the canonical, named version or a local spec can be found in the `/bin` directory, with usage available with the `--help` flag.

```bash
bin/validate-datapackage [-v remote_spec_ref | -l local_spec_path] [-d dataset_path]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 5 scripts in bin/ and the usage information isn't always clear about why you'd use one script or another, in addition, the names are all very similar. I think it would help a lot to document what the purpose of each script is here.

#!/usr/bin/env bash

# Script to validate a local JSON file against a schema specified in a GitHub repository.
# Usage: validate-data-package-json.sh [-r ref | -l local_schema_location] [-f datapackage_file]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this script does not have a .sh suffix, despite what the usage says

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempting to run this script on the sample datapackage.json results in a validation error:

./bin/validate-data-package-json -l spec -f samples/template/TIDES/datapackage.json           (issue-40-validate-example-data)TIDES
Validating data package file in 
JSON/YAML string must be a a map, cannot load "spec/tides-data-package.json"

Am I doing something wrong, or is there a problem somewhere else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also fails without specifying -l

./bin/validate-data-package-json  -f samples/template/TIDES/datapackage.json                  (issue-40-validate-example-data)TIDES
Validating data package file in 
Schema file not found on GitHub for the specified TIDES version: 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script and validate-data-package-json use -f but validate-data-package uses -d, they should probably use the same parameters for the same inputs

Comment on lines 7 to +13
Usage: validate-datapackage-to-profile [-r remote_spec_ref | -l local_spec_path] [-f datapackage_file]
-r remote_spec_ref: Optional. Specify the ref name of the GitHub repository for validating agianst
a remote profile where the profile is in the sub-path /spec/tides-data-package.json.
a remote profile where the profile is in the sub-path /spec/tides-data-package.json.
Should not be used with -l option. Example: -r main
-l local_spec_path: Optional. Specify the location of the local tides-data-package-json to use.
Should not be used with -r option. Example: -l spec
-d dataset_path: Optional. Specify the path of the datapackage.json file.
-d dataset_path: Optional. Specify the path of the datapackage.json file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage line says -f, but help and parsing use -d

@e-lo e-lo requested a review from a team as a code owner December 4, 2023 20:48
@botanize
Copy link
Contributor

@e-lo, can you resolve or address the comments I left in my last review?

@e-lo
Copy link
Contributor Author

e-lo commented Dec 11, 2023

yup! its on my list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants