Issue 40 validate example data #75

e-lo · 2022-10-07T05:52:07Z

This pull request:

adds example data
updates validation scripts
adds documentation for adding data and validating it

Use this instead because it is more configurable/extensible than the redirects.

1. Add example data (imperfect for now) generated by script `create_example` in data/example/data 2. Update validate-data GH action to refer to correct repos.

e-lo · 2022-10-07T05:55:11Z

Issues that are still not working:

References in datapackage.json are to githubusercontent.com, not local...not sure how to fix?
Not sure how to validate a single file using CLI to the schema
GH Action?
tides.spec.json should probably be a profile?

Would love thoughts form @botanize and @j-meelah

botanize · 2022-10-07T14:30:20Z

High-level, I'm not sure about the utility of generating random data to validate. It's a bit circular, and I'm not sure there's any real benefit. For example, could the random data generation script help anyone translate their existing CAD/AVL data into TIDES?

References in datapackage.json are to githubusercontent.com, not local...not sure how to fix?

It doesn't seem like these references are a problem? Though the profile on your datapackage.json should be tabular-data-package.

Not sure how to validate a single file using CLI to the schema

frictionless validate data/example/data/vehicle_locations --schema spec/vehicle_locations.schema.json

GH Action?

For validation? why not just frictionless validate data/example/data/datapackage.json? Can even output json or yaml (--json, --yaml)

tides.spec.json should probably be a profile?

I'm not sure anything in the spec matches what we provided in tides.spec.json. Everything in the spec that describes a collection requires data resources. Maybe we could make a profile that describes a collection of tableSchema? Seems like a lot of work, and the only work tides.spec.json is doing is listing the remaining files in spec (description is in the table schema, path is nonsense, required is always false). I recommend removing tides.spec.json.

I started producing example data from Metro Transit in the jpr_sample_data branch. It's kind of a lot of work to produce! Do you want to look at that and see how we can merge our work?

e-lo · 2022-10-07T16:02:25Z

High-level, I'm not sure about the utility of generating random data to validate. It's a bit circular, and I'm not sure there's any real benefit.

I agree - I think the real benefit is generating the blank CSVs to "fill out" if you want to. I will remove the random stuff :-)

e-lo · 2022-10-07T16:04:10Z

It doesn't seem like these references are a problem?

I was trying to figure out how to make a local reference, which doesn't seem to be as easy in the current structure b/c navigating folders w/out code is...

My goal was to be able to validate the example data against the spec/schemas which were in that branch so that we could evaluate any changes to data that any spec changes brought about.

e-lo · 2022-10-07T16:06:39Z

I recommend removing tides.spec.json.

I kind of agree. For now, I'll make an issue( #76 ) for us to ponder it.

- Linted - Auto-generated tabular-data-package datapackage.json from script - Removed random data and the scripts to generate it, kept csv templates - Added readmes and documentation about each example in directory -

botanize

This seems like a ton of infrastructure to add a few empty files.

How valuable is it to have a sample that validates no matter what the spec is when the sample is not representative of real data?

I don't love calling the base directory for the samples "data", much prefer "samples".

data/example/README.md

.github/workflows/validate-data.yml

data/example/data/datapackage.json

data/README.md

botanize · 2022-10-07T20:14:37Z

data/example/README.md

+`scripts\create_example.py` has some template code which can help with the following
+
+- `write_schema_examples()`: will generate blank csvs according to the TIDES schema
+- `write_datapackage()` will generate a datapackage.json based on the TIDES schemas and a set of defaults specified in th script.


I think I'd prefer a datapackage.json template instead of a function. Honestly, I'm still trying to figure out what the value of an empty generated example is that wouldn't be better served by a real sample?

I don't think the functions serve much purpose other than for developers to be able to auto-generate the example datapackage.json and CSV templates if/when the spec changes.

But what value is a CSV template? It's just a header row, which is trivial if you're formatting data from another source anyway. Like I guess you could tell python/R to append to the empty csv instead of writing a header, but that saves essentially no effort or may even create a tiny amount of work (e.g., fwrite(vehicle_locations, 'vehicle_locations.csv', append = TRUE) vs fwrite(vehicle_locations, 'vehicle_locations.csv').

main.py

e-lo · 2022-10-07T20:59:53Z

This would be cool if it worked, but it doesn't seem to be (or I can't get it to):

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2
      - name: Validate data
        uses: frictionlessdata/repository@v2
        with:
          packages: "data/*/data/datapackage.json"

Fails with

- Rendered samples as a table w/out readmes - Renamed data folders to samples and TIDES - Added metadata to datapackage.json about vendors, github handles, NTDID, agency name and exposed in documentation - Updated documentation for consistency - Renamed 'example' to 'template' to be more consistent with what it is

botanize · 2022-10-08T00:54:37Z

spaces in urls! here's a link that works https://repository.frictionlessdata.io/pages/dashboard.html?user=TIDES-transit&repo=TIDES&flow=validate+example+tides+data&run=3206719404

But what's really annoying about that link is that it's just a way for them to monitor usage? The artifact is in our repo, so just look there, e.g., https://github.com/TIDES-transit/TIDES/suites/8673442632/artifacts/390626873

botanize

I'm still struggling to see what purpose CSVs with only header rows can serve. With that in mind, I don't really think a "template" directory is very useful at all. I think the simplest and most effective thing to do is to embed an example dataresource.json file into /samples/README.md and remove all of /samples/template.

README.md

botanize · 2022-10-08T01:44:33Z

samples/README.md

+| `sources` | Array of data sources formatted as a [`source`](#data-source) | Recommended |
+
+### Data Source
+
+| **Field** | **Description** | **Required** |
+| --------- | --------------- | ------------ |
+| `title` | Description of the data source. | Required |
+| `component` | What technology component was used to generate this data (directly or indirectly)? Examples include `AVL`, `APC`, `AFC`, etc.  | Recommended |
+| `product` | What product was used to generate this data (directly or indirectly)? | Recommended |
+| `vendor` | What company makes this product? | Recommended |


I think the primary use for this if for people to see at a glance if the example they're looking at is comparable to their system—this level of precision, with vendor, product and component per resource is overkill and leads to a lot of duplication in the dataresource.json file.

For example, in our system vehicle_locations, stop_visits, passenger_events, station_activities, trips_performed would all be the same vendor/product (Vontas/TransitMaster), and the component would be an array (["AVL", "APC"]). Furthermore, AVL, APC, AFC are all implied in each table, by definition.

I suggest we move the sources array to the top level of the data package object and simplify the source object to source, and components. For example,

{ "title": "Metro Transit Basic", "name": "metrotransitmn-basic", "agency": "Metro Transit", "ntd_id": 50027, "sources": [ { "source": "TransitMaster", "components": ["APC", "AVL"]}, { "source": "emtrac", "components": "AVL"}, { "source": "Cubic", "components": "AFC"} ], "resources": [ { "path": "TIDES/vehicle_locations.csv", "profile": "tabular-data-resource", "schema": "https://raw.githubusercontent.com/TIDES-transit/TIDES/main/spec/vehicle_locations.schema.json" } ] }

The main value I see in putting it under each resource is to explicitly say where each piece of the "TIDES pie" is coming from so that agencies that are trying to mimic another agency's a data flow diagram can do so. It's not necessarily implicit which resource comes from which product. For example, there are many agencies that receive APC data from their "APC product" that is then manipulated by another product (including internal scripts) and then spat out to other systems. Exposing this seems like it would add a great deal of value. That said, there is no reason why you can't just put all the sources up top....especially if we aren't articulating a profile which would require it and are merely suggesting.

Sure. It could be nested, big-picture at the top-level, details for each resource, both optional.

samples/README.md

samples/template/TIDES/datapackage.json

samples/template/scripts/create_template.py

samples/README.md

botanize · 2022-10-10T15:24:33Z

samples/README.md

+### Template
+
+A `datapackage.json` template is available at [`/data/template/TIDES/datapackage.json`](https://raw.githubusercontent.com/TIDES-transit/TIDES/main/data/template/TIDES/datapackage.json).
+


Change to

### Example datapackage.json ``` { "title": "Example Title", "name": "example-name", "description": "a description goes here, can include markdown formatting", "agency": "Metro", "ntd_id": 00000, "profile": "tabular-data-package", "licenses": [{"name": "Apache-2.0"}], "contributors": [ {"title": "Name Name", "github": "my_handle", "email": null} ], "resources": [ { "name": "vehicle_locations", "path": "TIDES/vehicle_locations.csv", "schema": "https://raw.githubusercontent.com/TIDES-transit/TIDES/main/spec/vehicle_locations.schema.json" } ] } ```

I don't see why this is better when we can include the file in the markdown? I'd prefer to manage data as data, not as data within markdown.

But it's not data!—it's documentation, an example that doesn't even have meaningful values. Documentation for just about every software tool is littered with example code and configuration, this is exactly the same.

If you want to include a file, go ahead, but I do think that at the very least the file should have an obviously different name, e.g., datapackage.json.template or datapackage.sample or something like that so that it's not accidentally interpreted as a real data package description.

README.md

- changed workflow title to fix link with spaces - moved datapackage.json up one directory, updated docs, updated source refs - added missing --schema flag

per request

e-lo · 2022-10-10T17:43:07Z

All, There are two things @botanize are in (what I would hope we both categorize as) healthy disagreement about that it might be useful for the broader community to weigh in on:

Sources: should we document them at the resource- or datapackage-level? (see discussions above)
Templates: should we include datapackage.json and csv templates that are auto-generated from the spec? Or just include documentation about datapackage.json in README.md? Or some middle ground? (see discussions above)

cc: @jlstpaul @hunterowens ...other?

- remove datapackage templating (for now) - add CLI

Add macro to document datapackage Update datapackage profile to have better examples and fix errata/errors

…xample-data

e-lo · 2023-09-20T18:02:21Z

Updated this PR to latest main and it is passing and I think is ready to merge @botanize ?

botanize

Hi E,

I think I've pointed out all the errors and inconsistencies, but you may want to double-check the syntax and usage of the scripts in bin. I also think we need more documentation of those scripts.

I'm also worried that the contents of samples/template/TIDES doesn't validate.

botanize · 2023-09-20T19:10:32Z

README.md

+Specific files can be validated by running the frictionless framework against them and their corresponding schemas as follows:
+
+```sh
+frictionless validate vehicles.csv --schema https://raw.githubusercontent.com/TIDES-transit/TIDES/main/spec/schema.vehicles.json


This requires --schema-sync, or optional fields that are left out will cause validation errors

botanize · 2023-09-20T19:11:15Z

README.md

+
+### Template
+
+Templates of `datapackage.json` and each TIDES file type are located in the `/samples/template` directory.


Do you want to mention how these should be used?

bin/validate-data-package

botanize · 2023-09-20T19:18:40Z

bin/validate-data-package

+# Script: validate_frictionless_package.sh
+# Description: Bash script to validate a Frictionless Data Package using the Frictionless CLI.
+# Usage: validate_frictionless_package.sh [-v tides_version | -l local_schema_location] [-d dataset_location]
+#   -v tides_version: Optional. Specify the version of the TIDES specification or 'local' to


Is this parameter necessary? Shouldn't this always use the schema specified in the datapackage or treat -l as an override? In other words, if -l is provided, override the datapackage, otherwise use the datapackage, no need for -v.

In fact, it seems like -l should be -s, which lets you set any schema location, overriding any datapackage. It wouldn't need to be local, could also be a GitHub location or NFS or samba, &c.

Also, the samples/template/TIDES doesn't validate, should it?

botanize · 2023-09-20T19:23:49Z

bin/validate-data-package

this file is not marked executable

how does it differ from bin/validate-datapackage, should it replace it?

bin/validate-data-package

botanize · 2023-09-20T19:32:32Z

README.md

 Several other validation scripts and tools with more flexibility such as validating to the canonical, named version or a local spec can be found in the `/bin` directory, with usage available with the `--help` flag.

 ```bash
 bin/validate-datapackage [-v remote_spec_ref | -l local_spec_path] [-d dataset_path]


There are 5 scripts in bin/ and the usage information isn't always clear about why you'd use one script or another, in addition, the names are all very similar. I think it would help a lot to document what the purpose of each script is here.

botanize · 2023-09-20T19:34:36Z

bin/validate-data-package-json

+#!/usr/bin/env bash
+
+# Script to validate a local JSON file against a schema specified in a GitHub repository.
+# Usage: validate-data-package-json.sh [-r ref | -l local_schema_location] [-f datapackage_file]


this script does not have a .sh suffix, despite what the usage says

Attempting to run this script on the sample datapackage.json results in a validation error:

./bin/validate-data-package-json -l spec -f samples/template/TIDES/datapackage.json (issue-40-validate-example-data)TIDES Validating data package file in JSON/YAML string must be a a map, cannot load "spec/tides-data-package.json"

Am I doing something wrong, or is there a problem somewhere else?

It also fails without specifying -l

./bin/validate-data-package-json -f samples/template/TIDES/datapackage.json (issue-40-validate-example-data)TIDES Validating data package file in Schema file not found on GitHub for the specified TIDES version:

botanize · 2023-09-20T19:40:39Z

bin/validate-datapackage-to-profile

This script and validate-data-package-json use -f but validate-data-package uses -d, they should probably use the same parameters for the same inputs

botanize · 2023-09-20T19:42:09Z

bin/validate-datapackage-to-profile

 Usage: validate-datapackage-to-profile [-r remote_spec_ref | -l local_spec_path] [-f datapackage_file]
   -r remote_spec_ref: Optional. Specify the ref name of the GitHub repository for validating agianst
-             a remote profile where the profile is in the sub-path /spec/tides-data-package.json. 
+             a remote profile where the profile is in the sub-path /spec/tides-data-package.json.
             Should not be used with -l option. Example: -r main
   -l local_spec_path: Optional. Specify the location of the local tides-data-package-json to use.
             Should not be used with -r option. Example: -l spec
-   -d dataset_path: Optional. Specify the path of the datapackage.json file. 
+   -d dataset_path: Optional. Specify the path of the datapackage.json file.


Usage line says -f, but help and parsing use -d

botanize · 2023-12-11T15:01:30Z

@e-lo, can you resolve or address the comments I left in my last review?

e-lo · 2023-12-11T18:23:59Z

yup! its on my list

e-lo added 5 commits September 26, 2022 15:41

Merge content from main repo using Macro include_files

a24f119

Use this instead because it is more configurable/extensible than the redirects.

[ actually include the macros ]

74b02d9

Add Examples

9db30a9

first hack at creating random example data

7fdddfe

Initial example data.

dba9e26

1. Add example data (imperfect for now) generated by script `create_example` in data/example/data 2. Update validate-data GH action to refer to correct repos.

e-lo self-assigned this Oct 7, 2022

e-lo added 3 commits October 7, 2022 10:42

Update Based on Comments

eaa009a

- Linted - Auto-generated tabular-data-package datapackage.json from script - Removed random data and the scripts to generate it, kept csv templates - Added readmes and documentation about each example in directory -

update validate-data workflow to match datapackage.json

af15c9a

Merge branch 'main' into issue-40-validate-example-data

c0acbd1

e-lo requested review from botanize and j-meelah October 7, 2022 18:02

botanize reviewed Oct 7, 2022

View reviewed changes

e-lo added 2 commits October 7, 2022 13:31

Merge upstream into issue-40-validate-example-data

bcc5a0f

Clean up merge fails

3532b15

e-lo marked this pull request as ready for review October 7, 2022 23:44

e-lo requested a review from botanize October 7, 2022 23:45

botanize requested changes Oct 10, 2022

View reviewed changes

e-lo added 4 commits October 10, 2022 10:00

Fix validate CLI call

3aadfea

Responses to comments

9661da0

- changed workflow title to fix link with spaces - moved datapackage.json up one directory, updated docs, updated source refs - added missing --schema flag

Removed pandas from markdown table writing

5332619

per request

pre-commit

e05cd3c

e-lo mentioned this pull request Oct 19, 2022

🚀💻 – Validate sample data against TIDES frictionless specification #40

Closed

6 tasks

botanize added a commit that referenced this pull request Nov 2, 2022

update keys to match PR #75

1e6c7b9

This was referenced Jan 7, 2023

Add documentation of sample data structure #100

Closed

Pr2 sample data add workflow #101

Closed

e-lo requested a review from SorenSpicknall June 14, 2023 18:43

e-lo added 10 commits July 12, 2023 09:03

merging upstream changes

9d50d5d

lint

06a8294

update/simply script for creating template files

102e1c3

- remove datapackage templating (for now) - add CLI

add recommended fields

d7a9756

Add datapackage documentation page

446ec34

Add macro to document datapackage Update datapackage profile to have better examples and fix errata/errors

Merge remote-tracking branch 'upstream/main' into issue-40-validate-e…

5088e6c

…xample-data

add local validation scripts

ec3c7e8

Merge remote-tracking branch 'origin/main' into pr/75

1b94140

pep8/precommit

9d8f05f

came back - re-deleting

88e19b0

e-lo requested a review from botanize September 20, 2023 18:01

botanize requested changes Sep 20, 2023

View reviewed changes

bug/typo fixes

cb6dd18

e-lo requested a review from a team as a code owner December 4, 2023 20:48

Merge remote-tracking branch 'origin/main' into pr/75

5bc9310

		### Template

		A `datapackage.json` template is available at [`/data/template/TIDES/datapackage.json`](https://raw.githubusercontent.com/TIDES-transit/TIDES/main/data/template/TIDES/datapackage.json).


		### Template

		Templates of `datapackage.json` and each TIDES file type are located in the `/samples/template` directory.

Issue 40 validate example data #75

Are you sure you want to change the base?

Issue 40 validate example data #75

Uh oh!

Conversation

e-lo commented Oct 7, 2022

Uh oh!

e-lo commented Oct 7, 2022

Uh oh!

botanize commented Oct 7, 2022

Uh oh!

e-lo commented Oct 7, 2022

Uh oh!

e-lo commented Oct 7, 2022

Uh oh!

e-lo commented Oct 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

botanize left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

e-lo Oct 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

e-lo commented Oct 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

botanize commented Oct 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

botanize left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

e-lo commented Oct 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

e-lo commented Sep 20, 2023

Uh oh!

botanize left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

e-lo commented Oct 7, 2022 •

edited

Loading

e-lo Oct 7, 2022 •

edited

Loading

e-lo commented Oct 7, 2022 •

edited

Loading

botanize commented Oct 8, 2022 •

edited

Loading

e-lo commented Oct 10, 2022 •

edited

Loading