-
Notifications
You must be signed in to change notification settings - Fork 26
Conversation
Thanks Gabriel! 1 - What's a signal? For the term 'signals' what do you mean exactly, as far as the definition of the word? Like Waze is 'signalling' an event? Simply trying to understand the word choice for this new table. This makes good sense for the dates and enrichment. There is one change in the schema I'm not sure about:
Removing the jam_id and irregularity_id from Alert table may not be good - easier to connect the alerts to the other tables. Waze has it in their schema and it seems like we may want to keep it there? Maybe leave all 3? I'm also not sure enough on if the Waze structure will actually remove duplicates in the Signal table. Does the data line up enough around alerts, jams, irregularities (a/j/i)? Do you have to validates every new a/j/i insert against the signal table to see if it's already there. I'd like to loop in @jrstill on this and we can work out the details. @hmdavis may have some feedback too. |
Hi, Michael. Thanks for the prompt response! 1 - With respect to "Signal", I refer to the JSON file, stored in S3, that Waze updates every 2 minutes, containing all jams, alerts and irregularities. I couldn't find a better word, maybe you guys could help? But to conclude my rationale, if that were the case, a UNIQUE constraint on the signals.startTimeMillis field would avoid these kind of duplicates. I already saw a few things to change in this PR, but I will wait for more comments from the team. Please let me know what you guys think. |
If the Waze data itself references the |
Hi, @jrstill. It seems that those uuid are not unique. Check your Slack - I will send you two Waze JSONs from our polygon, with a 5 minute interval between them. The jam uuid 6143090 is present in both, and with different speeds! Also, it sounds a little risky to me having an UPSERT operation based on a field that is controlled by a third party. If Waze ever makes a mistake or presents a bug, we may have data erased from our database. I think @schnuerle 's point of view of avoiding constraints might hold in this case. Maybe we keep the fields there ("alerts.jam_id" and "alerts.irregularity_id"), but not with FK constraint. Another option is to create a composed UNIQUE constraint with (jam_date, uuid). That is what I have done here. Let me know what you guys think and, if you find it appropriate, I'll push another commit to this PR. |
@GabrielBogo - I talked with @jrstill today for a while and I know you all talked earlier. Here's what we came up with for you to review. Note AJI is my shorthand for Alerts Jams Irregularities.
This will still enter a row for each AJI every 2 minutes (see comment 1 above). It's useful though if we are back loading missing data with JSON files. Then it knows not to add a new row for duplicate data files. Matches will still update the row, which is good for backfilling new columns of data Waze may add.
I know this is similar to the DataFile table but think its purpose is distinct. DataFile is more about finding processing issues. ReportPeriod is more about the time of the AJI report. Later, we can enrich this with week of the year number, quarter number, time blocks (eg a reference to a future table that defines traffic engineer block of hours for reporting, eg Morning Peak, or Weekend Night)
So lots of good DB schema changes here! I'm not sure if when agreed upon if I should take your PR and modify it or do a new PR. |
Hi, @schnuerle. Your proposals seem plausible and well considered. See below my comments:
Extra: Are you guys in favor of adding the JSON column for the jam's coordinates, as initially proposed in this PR? Cheers, |
|
I think we are almost there. Upon agreement, I volunteer to start implementing these changes, but I don't know about your project's roadmap, team and deadlines. Please let me know what you guys think. Cheers, |
It sounds like the PK issue (or, more generally, how do we recognize a unique AJI record) is a bit of a sticking point still, so I'll throw in my understanding of it. It appears that, even if you request the data feed every 10 seconds, the start and end times at the root level are in 1 minute increments aligned exactly with the minute. So if you request it any time between 1:01:00 and 1:01:59, the start will be 1:01:00, even if you request the data every 10 seconds. However, the actual AJI records themselves may have different data within that time frame. To handle this, we came up with using the start time from the root level along with a hash of the AJI record itself to identify unique records. This will allow us to reprocess the same file as many times as we want, even if the file name gets changed, without duplicating data. This is a bit of a conglomeration of how AWS does ids (ARNs) and how github itself identifies commits (recursive SHA1 hashes over the whole tree + metadata). In this way, even if somehow the data retriever messed up and queued 5 copies of every file for processing, we wouldn't get duplicate records. The only way a new record goes in is if the actual AJI JSON is different OR the starttime at the root level is different. As far as how to ensure that the JSON hashing is idempotent, I can think of a few different ways we might go about that. I'll test some ideas and figure out which work and are fast enough, but overall I think that's definitely a solvable problems so I wouldn't worry much about it. One other thing I'll chime in on is storing the whole file JSON in a table in the DB. I'd recommend against that, though if it is what everyone wants I'll certainly do it. The files are already going to be stored forever in significantly cheaper S3 storage, plus putting them in the DB actually increases the size (because they affect indexes, there's metadata around their records, etc). If having the coordinates in JSON form stored in a field in the table makes certain use cases a lot easier, I'm all for that, but I'd be hesitant to duplicate the whole file in a field. |
Hi, @jrstill, thank you for the clarifications.
|
Switched to async/await (new node version support in lambda) Stubbed individual list (A/J/I) processort functions Added hashing function (see PR #25 discussion)
Ok great then I think we agree to store the json coordinates only for a specific AJI in a new jsonb field in each AJI table (in addition to the Coordinates table). The json for a specific AJI and the JSON for the entire file does not need to be stored in the DB. Next step is to rework the schema with all of these new changes. I can finish it early next week or if someone else gets to it first that’s great. |
Hi, @schnuerle . I will work on it today and see how far I can get. I'll get back to you on Monday for a status of my progress. Cheers, |
Hi, @schnuerle and @jrstill I just pushed a new version of the schema. I won't go over all the changes because they can be easily seen through the code diff. A few comments, though: 1 - The "VARCHAR[500]" datatype did not work here in my local database, so I changed it to TEXT. Would that be a problem? 2 - I changed the ids from BIGINT to SERIAL to allow for auto-increment 3 - Irregularities already contain the same date in two formats: milliseconds and string. I chose to keep both as-is and add a new UTC one. Would that be too much? 4 - Is there ever an "irregularity_id" in an Alert? I didn't find it here in my DataFiles nor in Waze's spec. 5 - For Alerts, why aren't we storing the "Confidence" field? 6 - I haven't tested the "coordinates", "roads" and "alert_types" tables. I'll proceed to those upon code review and agreement on the changes made so far. See below the code (Python) that I used to test the schema. It takes a DataFile, explodes the data and stores it in the AJI tables. The JSON ordered hash has worked properly until now. https://github.com/GabrielBogo/Joinville-Smart-Mobility/blob/master/src/data/store_data_file.py |
Thanks for working on it! 1- I think that's fine but it depends on the database we use. We were going to go with Aurora Postgres, so that's why it was varchar. I'll let @jrstill give his thoughts. 2- that's good 3- that makes sense 4- Maybe not, I'll need to double check where that came from. 5- We should be. I added it. 6- ok 7- It looks like you put the ReportPeriod fields we discussed into the data_file table. I'm ok with that. 8- I think we still need a RawRecord table. This is hashed JSON for each AJI record. This would be used to check for and prevent 100% duplicate records (including timestamp) when we are trying to run scripts to replace missing data with JSON files. @jrstill is going to check and see if this could be added as a field in each AJI table, instead of it's own table. |
Good. Let's wait for @jrstill 's input, then. One question, @schnuerle, regarding the RawRecord table: Wouldn't the field "waze.data_files.json_hash" be enough to guarantee uniqueness, being it the hash of the ordered contents of the entire data file (AJI + Timestamp)? |
code/sql/schema.sql
Outdated
"date_created" TIMESTAMP, | ||
"date_updated" TIMESTAMP, | ||
"file_name" TEXT NOT NULL, | ||
"json_hash" uuid NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash we create will be SHA1 , so this probably needs to be a varchar(40)
Well in this case this hash is a subset of the whole file. Eg, each single A, J, or I record. For example, the single Alert below would be hashed with the startTimeMillis from the root file (eg, do a SHA1 on "alertjson:startTimeMillis") to be a unique hash of this alert.
|
code/sql/schema.sql
Outdated
"date_updated" TIMESTAMP, | ||
"file_name" TEXT NOT NULL, | ||
"json_hash" uuid NOT NULL, | ||
UNIQUE ("start_time_millis", "json_hash") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My plan is to actually embed the start_time_millis into the hash, so we won't need a separate unique constraint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed.
I think postgres expects I'm going through and adding some comments in the PR on specific lines, then can circle back here to summarize. |
@schnuerle I still can't see why would that be necessary, given that we are already guaranteeing that the data_files are unique, and all AJI records are Foreign Keyed to a single data_file... Maybe I am losing something? |
Looks good. As long as a JOIN will still work easily on it from the Jams table. |
I think the type field should be added to the DB if it’s there - was not left out intentionally. I haven’t seen it before. Maybe they updated the spec again? I’ll check tomorrow. |
I also just noticed there's a second coordinates list field on a jam (shown in the Waze PDF) called |
I’ve never seen turnline either. Must be new. I guess add it. I’ll have to check for a new spec file tomorrow in the Waze portal unless Gabriel beats me to it. |
I have no information either. The latest spec is of August/2017, right? If we add it, I would just suggest that we add an additional JSONB column, just as with the jam line. |
If we also want to store the individual coordinates for a turnLine in a separate table, do we add a second coordinate table for those, or add some sort of type on the existing coordinate table? We can't just put both line and turnLine in the one table as-is because we'd have no way to tell them apart. |
So there is a new spec. Jason I emailed you access. Gabriel you can get it from the CCP partner portal. It's version 2.8.2 (the .2 is new) though that's in the file name only and not on the title page. But all the new fields we are talking about here are documented in there. Jason just add these all to the schema for now as you see fit. I think the turn line should go into the same table, but with a 'type' column, since more may be coming. |
The only difference I see when diff-ing the 2.8 and 2.8.2 files is that Jam speed has been changed from:
to
which seems like kind of a big deal from an analysis standpoint, but doesn't really matter for me in loading the data. EDIT: Looks like that was probably just a copy/paste mistake they were fixing in the doc, since the data in speed does seem to be the m/s equivalent to the km/h value in speedKMH. |
Added setup of default permissions for lambda role (if it has been created) Added type and turnLine to jams table Added coordinate_type lookup table (and values) Added coordinate_type_id to coordinates table Changed coordinates id to hash
Add Location coordinate type (for alerts)
Fixed syntax error (hanging comma) Added schema-level permission grant for lambda role
I think we're almost there with this schema, it is just irregularities now that have me concerned. There are 4 fields in the schema that aren't in the waze documentation for the file and don't appear to be things we generate from other data:
I still haven't seen an actual irregularity record in a file, so I don't know if those are valid fields that could be there and the doc is just wrong, or if the doc is correct and those shouldn't show up. |
I emailed you 4 examples of Waze Unusual Traffic Alert emails that I signed up for from Waze. I always thought these were irregularities but didn’t double check. They can give you a time and date and location to find in the JSON files. Not sure what to do about those 4 fields you mentioned, but I wouldn’t let it hold you up for now. We can modify/fix later. |
I picked one (the January 17th at 7:30 am one) and grabbed all the data files from 7:10 to 7:40, but didn't find any irregularities in them. I then setup an AWS Glue crawler to catalog all the data in the bucket we'd previously created in the louisvillebeta account and it looks like it wasn't able to find any files with irregularities either. For now I've got code in the processor that assumes the data will match their spec, and I'm just putting nulls in those extra fields in the DB schema. I think we can run with that until someone actually runs into an irregularity, and even then only if it fails to process. If it's ok with you, I'm going to go ahead and accept this PR, then merge it over to my branch as well, and then we can look at updating the READMEs before merging my branch back into master. I can start a new PR to give us a place to review that. |
@jrstill I will send you in Slack one file that contains irregularities. See if it works for you. |
Thanks @GabrielBogo, that is exactly what I needed. At first glance, it looks like the fields the DB schema has but the spec didn't are in fact in the file, so I definitely need to update the processor to load them. There are also some other fields in the file that weren't in the spec or our schema. I have to spend the rest of the morning preparing for a meeting, but I'll try to take a closer look this afternoon or over the weekend. |
Reviewing the irregularities file some more, here's what I've found:
Of those differences above, I think the I'll get working on the other parts while we figure out what to do with that alerts array. |
Add new fields for irregularities
I've been doing some testing around throwing large quantities of files at the processor in an attempt to make sure users will be able to reprocess old data without issue. There were, initially, a lot of issues, most of which I've worked through. In looking at the postgres stats, I noticed a ton of wait time being spent on |
@jrstill, what is the % increase in throughput when dropping all FKs? I am not sure, but it seems to me that a decrease in robustness has the potential to be immensely more costly than a few extra hours (or even days) of historical data processing. But I think it should be definitely handled sooner or later. |
I'm rethinking my suggestion. Maybe instead of removing the FKs, which in normal operation (just processing a single file every 2 minutes) will only be a difference of seconds, we instead document that if someone wants to reprocess a lot of historical files they can temporarily remove/disable foreign keys to greatly increase throughput. |
I like that! |
That seems like a potentially complicated thing to document and have govs execute on. I'd err on the side of ease of use. If you think it can be done easily all around, ok. But I imagine that there will be a good number of govs who are already collecting JSON files (including Louisville, and the ones who used our code, or are doing processing on their own, or are using them as part of an existing process, or get them from another source like USDOT to backload) that will want to start with back loading a bunch of data. I don't want to make it difficult for them. |
Yeah, you are right, @schnuerle. No gov should have to tweak the schema to run the processor. I still worry about robustness, though. If it's only one or two extra seconds per file, it means that the total processing time would be two or three times higher if keeping the FKs, correct? I think it's a low price to pay to ensure consistency in the data. Do you think it's unfeasible to keep the FKs, @jrstill ? |
We could definitely provide a script that would disable the FKs, and another to re-enable them, but users would need to be prepared to deal with any issues, even if they are unlikely to happen, that might arise from inconsistent data getting loaded. I'm still leaning toward keeping the FKs but document that if you want to load a ton of existing files and you need it to happen faster, there is a way to do it, but it has risks. Maybe we don't even provide the script, we just describe what can be done with the warning that the user is on their own if they decide to do it, and otherwise they can just wait for however long it takes with FKs enabled. |
Since it would be easier to remove the FKs later than to add them back in (since there could be inconsistent data), how about we go ahead and accept the PR as-is and just make note of the performance stuff in a new or existing issue (maybe #31)? Then I can merge the sql changes into my branch (since it requires them) and start a PR for it so we can review things before bringing it into master. |
Hi. See below changes and justifications:
1 - Create table "signals", referenced by tables "jams", "alerts" and "irregularities"
2 - Added "jams.coordinates" JSON field (which actually stores an ordered list)
3 - Changed foreign keys of table "alerts"
psql:schema.sql:41: ERROR: there is no unique constraint matching given keys for referenced table "jams"
The field "alerts.jam_id" references "jams.uuid", which is not required to be UNIQUE.
Sample code (to justify the proposed changes):