Skip to content

feat: added fare_leg_join_rules.txt schema and validation #1956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 4, 2025
Merged

Conversation

cka-y
Copy link
Contributor

@cka-y cka-y commented Jan 30, 2025

Summary:

  • New Schema: Introduces the fare_leg_join_rules.txt schema.
  • Foreign Key Validation: Adds a new notice to verify that every network_id found in fare_leg_join_rules.txt matches an entry in networks.txt or routes.txt.
  • Required Fields: Adds a notice for missing required fields when from_stop_id or to_stop_id is present but the other is omitted, ensuring both are supplied consistently.

Expected behavior:
image

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with gradle test to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@cka-y cka-y linked an issue Jan 30, 2025 that may be closed by this pull request
@emmambd
Copy link
Contributor

emmambd commented Jan 30, 2025

This is close to covering #1932 as well if from_network_id and to_network_id checks were added - not sure if it adding it to this PR is easier or if it's still worth splitting up as originally recommended

@cka-y
Copy link
Contributor Author

cka-y commented Jan 30, 2025

@emmambd
The checks that are currently implemented for to_network_id and from_network_id are

  • both are required
  • both represent a FK either from routes.txt or networks.txt

is there a check that needs to be modified/added to cover #1932?
I followed the descriptions here for the rules: https://gtfs.org/documentation/schedule/reference/#fare_leg_join_rulestxt

@cka-y cka-y linked an issue Jan 30, 2025 that may be closed by this pull request
Copy link
Contributor

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 207ef2e
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

New Warnings (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

🛡️ Corruption Check

0 out of 1801 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 3.64 3.76 ⬆️+0.11
Median -- 1.33 1.41 ⬆️+0.08
Standard Deviation -- 10.42 10.63 ⬆️+0.21
Minimum in References Reports us-california-flex-v2-developer-test-feed-2-gtfs-1818 0.46 0.60 ⬆️+0.13
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 281.78 293.24 ⬆️+11.47
Minimum in Latest Reports es-madrid-consorcio-regional-de-transportes-de-madrid-gtfs-792 0.50 0.52 ⬆️+0.02
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 281.78 293.24 ⬆️+11.47
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 472.69 MiB 463.94 MiB ⬇️-8.75 MiB
Median -- 333.92 MiB 334.85 MiB ⬆️+946.93 KiB
Standard Deviation -- 796.65 MiB 751.59 MiB ⬇️-45.07 MiB
Minimum in References Reports ro-vrancea-consiliul-judetean-vrancea-gtfs-1984 40.02 MiB 40.05 MiB ⬆️+24.07 KiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.73 GiB 10.90 GiB ⬆️+178.30 MiB
Minimum in Latest Reports us-california-thousand-oaks-transit-gtfs-33 43.16 MiB 39.20 MiB ⬇️-3.97 MiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.73 GiB 10.90 GiB ⬆️+178.30 MiB

* </ul>
*/
@GtfsValidator
public class FareLegJoinRuleValidator extends FileValidator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include shouldCallValidate(logic based on the missing file method). This does not improve performance, but it's good practice. We can get a log on why some validators don't run without digging too deeply into the code.

Copy link
Contributor

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit a90fef1
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

New Warnings (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

🛡️ Corruption Check

0 out of 1801 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 3.64 3.70 ⬆️+0.06
Median -- 1.35 1.39 ⬆️+0.04
Standard Deviation -- 10.39 10.45 ⬆️+0.06
Minimum in References Reports us-michigan-detroit-people-mover-gtfs-417 0.47 0.48 ⬆️+0.01
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 279.42 281.07 ⬆️+1.64
Minimum in Latest Reports us-michigan-detroit-people-mover-gtfs-417 0.47 0.48 ⬆️+0.01
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 279.42 281.07 ⬆️+1.64
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 466.44 MiB 463.84 MiB ⬇️-2.60 MiB
Median -- 332.81 MiB 332.99 MiB ⬆️+175.50 KiB
Standard Deviation -- 755.87 MiB 762.78 MiB ⬆️+6.91 MiB
Minimum in References Reports ro-vrancea-consiliul-judetean-vrancea-gtfs-1984 38.19 MiB 415.92 MiB ⬆️+377.73 MiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.62 GiB 10.99 GiB ⬆️+380.65 MiB
Minimum in Latest Reports us-colorado-san-miguel-county-gtfs-2195 399.92 MiB 39.10 MiB ⬇️-360.82 MiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.62 GiB 10.99 GiB ⬆️+380.65 MiB

@emmambd emmambd requested a review from skalexch February 3, 2025 20:24
Copy link
Contributor

github-actions bot commented Feb 3, 2025

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit e6bbc44
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

New Warnings (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

🛡️ Corruption Check

0 out of 1801 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 3.68 3.82 ⬆️+0.13
Median -- 1.34 1.43 ⬆️+0.09
Standard Deviation -- 10.68 10.82 ⬆️+0.14
Minimum in References Reports us-california-catalina-express-gtfs-299 0.48 0.60 ⬆️+0.12
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 287.31 295.82 ⬆️+8.52
Minimum in Latest Reports us-california-city-of-wasco-gtfs-1788 0.56 0.47 ⬇️-0.09
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 287.31 295.82 ⬆️+8.52
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 470.80 MiB 472.20 MiB ⬆️+1.40 MiB
Median -- 331.92 MiB 331.92 MiB ⬇️0 bytes
Standard Deviation -- 797.53 MiB 775.80 MiB ⬇️-21.73 MiB
Minimum in References Reports us-colorado-boulder-county-gtfs-2191 39.14 MiB 407.92 MiB ⬆️+368.79 MiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.88 GiB 10.64 GiB ⬇️-250.40 MiB
Minimum in Latest Reports ro-vrancea-consiliul-judetean-vrancea-gtfs-1984 40.05 MiB 38.35 MiB ⬇️-1.70 MiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.88 GiB 10.64 GiB ⬇️-250.40 MiB

@skalexch
Copy link
Contributor

skalexch commented Feb 4, 2025

@cka-y everything looks good, just one detail I'd like to double check with you and @emmambd. For from_network_id and to_network_id, we do check for foreign_key_violation but we do not check for missing_required_field.
There is a code that specifies if the fields are present but that seems to only be one of conditions that check for foreign_key_violation.

We might want to show missing_required_field before foreign_key_violation even when an empty field is probably also a foreign_key_violation

@cka-y
Copy link
Contributor Author

cka-y commented Feb 4, 2025

@skalexch if i'm understanding properly, you're suggesting to change the display order of the notices so that missing_required_field is before foreign_key_violation?

@skalexch
Copy link
Contributor

skalexch commented Feb 4, 2025

@cka-y I'm suggesting the addition of another missing_required_field notice that checks that from_network_id and to_network_id are not empty. Then, if they are not empty, we can proceed to the foreign_key_violation check.

@cka-y
Copy link
Contributor Author

cka-y commented Feb 4, 2025

Currently the logic is:

  • if from_network_id is present -> validate the FK of the id in from_network_id
  • if to_network_id is present -> validate the FK of the id in to_network_id
  • validate the simultaneous presence of to_stop_ip and from_stop_id [Here is where the missing_required_field notice is generated if one is present and the other is omitted]

I think the current logic does cover your suggestion but I could be mistaking.

@skalexch
Copy link
Contributor

skalexch commented Feb 4, 2025

@cka-y the missing_required_field in this PR validates the simultaneous presence of from_stop_id and to_stop_id. The stop ids are conditionally required where if one exists the other has to exist too (which is taken care of in this PR).

However, from_network_id and to_network_id are required regardless of each other. So even before proceeding to validating the FK, from_network_id and to_network_id both need to exist.

Basically, this should trigger a missing_required_field notice:

from_network_id to_network_id

@cka-y
Copy link
Contributor Author

cka-y commented Feb 4, 2025

@skalexch Yes sorry about that this is actually covered as both are annotated with @Required in the schema here. I updated the logic as the check was for to_stop_id and from_stop_id.

@skalexch
Copy link
Contributor

skalexch commented Feb 4, 2025

Looks good!

Copy link
Member

@davidgamez davidgamez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented Feb 4, 2025

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 4700a7a
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

New Warnings (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

🛡️ Corruption Check

0 out of 1801 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 3.65 3.75 ⬆️+0.11
Median -- 1.34 1.42 ⬆️+0.08
Standard Deviation -- 10.36 10.47 ⬆️+0.10
Minimum in References Reports us-michigan-detroit-people-mover-gtfs-417 0.47 0.53 ⬆️+0.06
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 280.40 283.32 ⬆️+2.92
Minimum in Latest Reports us-florida-citrus-county-transit-gtfs-630 0.56 0.51 ⬇️-0.06
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 280.40 283.32 ⬆️+2.92
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 465.96 MiB 471.54 MiB ⬆️+5.58 MiB
Median -- 333.92 MiB 331.92 MiB ⬇️-2.00 MiB
Standard Deviation -- 757.94 MiB 784.53 MiB ⬆️+26.59 MiB
Minimum in References Reports ro-vrancea-consiliul-judetean-vrancea-gtfs-1984 38.35 MiB 40.05 MiB ⬆️+1.70 MiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.96 GiB 10.63 GiB ⬇️-343.41 MiB
Minimum in Latest Reports us-massachusetts-marthas-vineyard-transit-authority-gtfs-420 39.88 MiB 39.73 MiB ⬇️-159.71 KiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.96 GiB 10.63 GiB ⬇️-343.41 MiB

@cka-y cka-y merged commit 5ea40f1 into master Feb 4, 2025
137 checks passed
@cka-y cka-y deleted the feat/1950 branch February 4, 2025 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

foreign key changes for Fares changes missing_required_field additions where network.network_id exists
4 participants