Skip to content
This repository has been archived by the owner on Jan 18, 2024. It is now read-only.

Improvements in RDS Schema #25

Merged
merged 17 commits into from
Apr 24, 2018
Merged

Conversation

GabrielBogo
Copy link
Contributor

@GabrielBogo GabrielBogo commented Mar 27, 2018

Hi. See below changes and justifications:

1 - Create table "signals", referenced by tables "jams", "alerts" and "irregularities"

  • This will keep track of all Waze signals, regardless of them containing any jams, alerts or irregularities. Empty signals are also important for analyses.
  • It may be used to avoid duplicates (storing the same document twice in the database)
  • It's also a way to store the document startDate and endDate (not pubmillis), absent in the current schema
  • It's a way of including an enriched field with TIMESTAMP WITH TIMEZONE type without messing with the other table schemas.

2 - Added "jams.coordinates" JSON field (which actually stores an ordered list)

  • Makes it easier to calculate jam's direction (see sample below)
  • Makes it easer to convert the jams into geodataframe (see sample below)
  • Geodataframes are very powerful in facilitating spatial joins

3 - Changed foreign keys of table "alerts"

  • Could not create the "alerts" table with the original code. Error:
    psql:schema.sql:41: ERROR: there is no unique constraint matching given keys for referenced table "jams"
    The field "alerts.jam_id" references "jams.uuid", which is not required to be UNIQUE.

Sample code (to justify the proposed changes):

def get_direction(coord_list):
    try:
      num_coords = len(coord_list)
    except:
      return pd.Series([None, None])
    
    #North/South
    y_start = coord_list[0]["y"]
    y_end = coord_list[num_coords-1]["y"]
    delta_y = (y_end-y_start)
    if delta_y >= 0:
        lat_direction = "North"
    else:
        lat_direction = "South"
        
    #East/West
    x_start = coord_list[0]["x"]
    x_end = coord_list[num_coords-1]["x"]
    delta_x = (x_end-x_start)
    if delta_x >= 0:
        lon_direction = "East"
    else:
        lon_direction = "West"

    #MajorDirection
    if abs(delta_y) > abs(delta_x):
        major_direction = "North/South"
    else:
        major_direction = "East/West"
        
    return pd.Series([lon_direction, lat_direction, major_direction])
import pandas as pd
import geopandas as gpd
from shapely.geometry import LineString

df_jams = pd.read_sql(INSERT HERE YOUR QUERY)
df_jams['jams_line_list'] = df_jams['coordinates'].apply(lambda x: [tuple([d['x'], d['y']]) for d in x])
df_jams['jam_LineString'] = df_jams.apply(lambda x: shapely.geometry.LineString(x['jams_line_list']), axis=1)
df_jams[["LonDirection","LatDirection", "MajorDirection"]] = df_jams["coordinates"].apply(get_direction)
crs = {'init': 'epsg:4326'}
geo_jams = gpd.GeoDataFrame(df_jams, crs=crs, geometry="jam_LineString")

@schnuerle
Copy link
Contributor

Thanks Gabriel!

1 - What's a signal? For the term 'signals' what do you mean exactly, as far as the definition of the word? Like Waze is 'signalling' an event? Simply trying to understand the word choice for this new table.

This makes good sense for the dates and enrichment.

There is one change in the schema I'm not sure about:

  • "jam_id" VARCHAR[500] REFERENCES waze.jams (uuid),
  • "irregularity_id" VARCHAR[500] REFERENCES waze.irregularities (uuid)
  • "signal_id" BIGINT NOT NULL REFERENCES waze.signals (id)

Removing the jam_id and irregularity_id from Alert table may not be good - easier to connect the alerts to the other tables. Waze has it in their schema and it seems like we may want to keep it there? Maybe leave all 3?

I'm also not sure enough on if the Waze structure will actually remove duplicates in the Signal table. Does the data line up enough around alerts, jams, irregularities (a/j/i)? Do you have to validates every new a/j/i insert against the signal table to see if it's already there.

I'd like to loop in @jrstill on this and we can work out the details. @hmdavis may have some feedback too.

@GabrielBogo
Copy link
Contributor Author

GabrielBogo commented Mar 28, 2018

Hi, Michael. Thanks for the prompt response!

1 - With respect to "Signal", I refer to the JSON file, stored in S3, that Waze updates every 2 minutes, containing all jams, alerts and irregularities. I couldn't find a better word, maybe you guys could help?
2 - Regarding "jam_id" and "irregularity_id" in the "alerts" table, I agree with you. For backwards compatibility, It would be better to keep all three there. I just changed because I found an error on these fields when I tried to create the schema locally, with the original schema.sql. Am I doing something wrong?
3 - I use "duplicate" to refer to a "Signal" that has been stored twice in S3. For instance, let's say we set the Cloudwatch to go fetch data every 1 minute: wouldn't we end up with repeated files (i.e. with the exact same j/a/i)? I always assumed that was the case, but now that you raised the question I am not sure anymore.

But to conclude my rationale, if that were the case, a UNIQUE constraint on the signals.startTimeMillis field would avoid these kind of duplicates.

I already saw a few things to change in this PR, but I will wait for more comments from the team.

Please let me know what you guys think.

@jrstill
Copy link
Contributor

jrstill commented Apr 2, 2018

If the Waze data itself references the uuid fields of jams/irregularities in the alerts data, it seems like we can probably reasonably assume those are unique (and the name would imply that as well). In that case, could we drop the id primary keys and just make the uuid fields on each table the primary keys, then always do UPSERTs to ensure we never get duplicate data into the tables?

@GabrielBogo
Copy link
Contributor Author

GabrielBogo commented Apr 3, 2018

Hi, @jrstill.

It seems that those uuid are not unique. Check your Slack - I will send you two Waze JSONs from our polygon, with a 5 minute interval between them. The jam uuid 6143090 is present in both, and with different speeds!

Also, it sounds a little risky to me having an UPSERT operation based on a field that is controlled by a third party. If Waze ever makes a mistake or presents a bug, we may have data erased from our database.

I think @schnuerle 's point of view of avoiding constraints might hold in this case. Maybe we keep the fields there ("alerts.jam_id" and "alerts.irregularity_id"), but not with FK constraint.

Another option is to create a composed UNIQUE constraint with (jam_date, uuid). That is what I have done here.

Let me know what you guys think and, if you find it appropriate, I'll push another commit to this PR.

@schnuerle
Copy link
Contributor

schnuerle commented Apr 3, 2018

@GabrielBogo - I talked with @jrstill today for a while and I know you all talked earlier. Here's what we came up with for you to review.

Note AJI is my shorthand for Alerts Jams Irregularities.

  1. First, FYI, the jam uuid is not unique across files, just unique for a jam. And from one minute to the next the only thing that may change is the speed, or nothing at all, and that's ok. But you still want a row to record that. Each minute is like a heartbeat of that jam, and sometimes it's the same heartbeat, but you still a record of it to paint a complete picture over time.

  2. We think that foreign keys may not be a good idea in this project. First, the FK is not always there in the data feed. Second, it limits our ability to do parallel processing on the data feed. Third, the Waze data feed is 'unreliable' in that they can (and do) change it. Having FKs is a limiting factor to ingesting the data which can (and has for us) cause data loss.

  3. Primary keys on AJI are a good idea. We think this should be the startTimeMillis at the JSON root plus a hash of the single AJI's JSON. Eg: 1522771560000:f186303fe03418924a556bf766cfced5 (I just used md5, but could be something else)

This will still enter a row for each AJI every 2 minutes (see comment 1 above). It's useful though if we are back loading missing data with JSON files. Then it knows not to add a new row for duplicate data files. Matches will still update the row, which is good for backfilling new columns of data Waze may add.

  1. For your signal table, it's really like pinging the data file so you know you got it, with some data file meta data. So I propose calling it DataFile (that's what we called this table in our internal database). It would have a generated unique ID, startTimeMillis and endTimeMillis from the JSON root, and the file name (eg wazedata_2017_11_01_20_13_24_263.json). Would also have UTC DateCreated and DateUpdated. All of this is very useful for error checking, finding gaps in data, and optimizing the system. Maybe a source field.

  2. We think though that there should be a second table called ReportPeriod. This would be used by the AJI tables, and would have ReportPeriodStart, ReportPeriodEnd which are UTC versions of startTimeMillis and endTimeMillis from the JSON root, a generated unique ID that would be referenced by the AJI tables with a ReportPeriodID.

I know this is similar to the DataFile table but think its purpose is distinct. DataFile is more about finding processing issues. ReportPeriod is more about the time of the AJI report. Later, we can enrich this with week of the year number, quarter number, time blocks (eg a reference to a future table that defines traffic engineer block of hours for reporting, eg Morning Peak, or Weekend Night)

  1. Local time zone is an issue, FYI. We aren't going to put it in the DB, and for now will be handled with query and viz logic instead. This is because a) with Daylight Saving these things change b) the Waze data doesn't specify the time zone anywhere c) a data feed can cross multiple time zones and the only way to know which is where is with a geospacial query against timezone shapes and rules. Maybe later a user can just pick one timezone upon install and it could be added to the ReportPeriod table.

  2. For the pubmillis (start of the event) field in AJI, we'd like to add a UTC field to each table. I think that won't cause duplicates since this goes down to the microsecond.

  3. We'd like to make a new third table called RawRecord, which contains unique AJI JSON data, by reference in the AJI tables with a new field. We'd check for duplicates by hashing the JSON to reduce the size of this table, and splitting it out helps with indexing speed in some DBs.

So lots of good DB schema changes here! I'm not sure if when agreed upon if I should take your PR and modify it or do a new PR.

@GabrielBogo
Copy link
Contributor Author

GabrielBogo commented Apr 3, 2018

Hi, @schnuerle. Your proposals seem plausible and well considered. See below my comments:

  1. Clear.

  2. Are you referring to FK in general or specifically to the ones in AJI tables? If it's the latter, I agree. If it's the former, I still believe some FKs may provide robustness with little loss in flexibility. In any case, we'd better discuss that with the schema in front of us.

  3. Great idea. Just to double-check, do hashes take into account the unordered nature of JSON?

  4. Agreed.

  5. I understand the reasons and makes sense. I would only question if the benefits would justify the increased complexity in schema and subsequent code. The simpler the code is, the easier it will be for other municipalities to build code upon it. I have a feeling that one single table DataFile would suffice. Again, we'd better discuss that on a written schema.

  6. Agreed. It's easy to deal with timezones with code.

  7. I don't think I understood this one. You propose adding an additional UTC field in AJI tables, which would be a different representation of the information in pubmillis? Why?

  8. Same as 5. I understand the reasons and makes sense. But that could be included in DataFile, as well. Performance would not be too big of an issue since this tables wouldn't be read that often.

  9. Upon agreement, I'm indifferent regarding keeping this PR or creating a new one.

Extra: Are you guys in favor of adding the JSON column for the jam's coordinates, as initially proposed in this PR?

Cheers,

@schnuerle
Copy link
Contributor

  1. Ok.

  2. The latter, just the AJI tables. Some FKs may be alright, though I'm not sure which yet.

  3. The idea was to take the JSON as is from Waze and hash that. If they change the order of things it would be a different hash. We could do a sort on the JSON attribute names and/ values of coordinates first.

  4. Ok.

  5. Agreed about complexity. It could be in the DataFile table. We went back and forth on this.

  6. Yes we'd like to enrich the AJI tables with UTC date/time fields. Querying would be easier - we currently convert to UTC on the fly in our queries and it slows things down.

  7. Ok.

  8. Ah, the idea here is just to keep one record's worth of JSON in a field in the RawRecord table. Is this valuable to you? You are right, if we store the entire JSON file then it would go in the DataFile table - I hadn't considered that - is this valuable to you?

  9. Ok.

  10. (Extra) Yes we were thinking of 8) solving this. I might have misunderstood. If you only want the coordinates in a field in the AJI tables (or maybe another field in the RawRecord table) that sounds good.

@GabrielBogo
Copy link
Contributor Author

  1. Ok
  2. Ok. I think this would be very important to deal with the very likely possibility of making two requests to the API within the same minute. This would result in duplicate JSONs but with different names and orders. We will have to deal with this issue when we get to the moment of merging our data into your architecture.
  3. Ok
  4. Ok.
  5. Yes, it would be valuable. Just to clarify, it would be two columns in the DataFile table, then? Let's say: DataFile.raw_json, DataFile.json_hash?
  6. Yes, I was thinking of a field in the AJI tables. Ok, then.

I think we are almost there. Upon agreement, I volunteer to start implementing these changes, but I don't know about your project's roadmap, team and deadlines. Please let me know what you guys think.

Cheers,

@jrstill
Copy link
Contributor

jrstill commented Apr 4, 2018

It sounds like the PK issue (or, more generally, how do we recognize a unique AJI record) is a bit of a sticking point still, so I'll throw in my understanding of it.

It appears that, even if you request the data feed every 10 seconds, the start and end times at the root level are in 1 minute increments aligned exactly with the minute. So if you request it any time between 1:01:00 and 1:01:59, the start will be 1:01:00, even if you request the data every 10 seconds. However, the actual AJI records themselves may have different data within that time frame. To handle this, we came up with using the start time from the root level along with a hash of the AJI record itself to identify unique records. This will allow us to reprocess the same file as many times as we want, even if the file name gets changed, without duplicating data. This is a bit of a conglomeration of how AWS does ids (ARNs) and how github itself identifies commits (recursive SHA1 hashes over the whole tree + metadata). In this way, even if somehow the data retriever messed up and queued 5 copies of every file for processing, we wouldn't get duplicate records. The only way a new record goes in is if the actual AJI JSON is different OR the starttime at the root level is different.

As far as how to ensure that the JSON hashing is idempotent, I can think of a few different ways we might go about that. I'll test some ideas and figure out which work and are fast enough, but overall I think that's definitely a solvable problems so I wouldn't worry much about it.

One other thing I'll chime in on is storing the whole file JSON in a table in the DB. I'd recommend against that, though if it is what everyone wants I'll certainly do it. The files are already going to be stored forever in significantly cheaper S3 storage, plus putting them in the DB actually increases the size (because they affect indexes, there's metadata around their records, etc). If having the coordinates in JSON form stored in a field in the table makes certain use cases a lot easier, I'm all for that, but I'd be hesitant to duplicate the whole file in a field.

@GabrielBogo
Copy link
Contributor Author

GabrielBogo commented Apr 4, 2018

Hi, @jrstill, thank you for the clarifications.

  • Clear regarding the uniqueness of the data. No further comments from my side.
  • OK regarding idempotent hashing. I'm curious about the solution.
  • I agree with you. For me, coordinates in JSON are definitely enough for usage. The whole file's hash is enough for guaranteeing uniqueness. File name is enough to reference the raw JSON in S3. I don't see much use in storing the entire JSON either.

jrstill pushed a commit that referenced this pull request Apr 5, 2018
Switched to async/await (new node version support in lambda)
Stubbed individual list (A/J/I) processort functions
Added hashing function (see PR #25 discussion)
@schnuerle
Copy link
Contributor

Ok great then I think we agree to store the json coordinates only for a specific AJI in a new jsonb field in each AJI table (in addition to the Coordinates table). The json for a specific AJI and the JSON for the entire file does not need to be stored in the DB.

Next step is to rework the schema with all of these new changes. I can finish it early next week or if someone else gets to it first that’s great.

@GabrielBogo
Copy link
Contributor Author

Hi, @schnuerle . I will work on it today and see how far I can get. I'll get back to you on Monday for a status of my progress.

Cheers,

@GabrielBogo
Copy link
Contributor Author

GabrielBogo commented Apr 9, 2018

Hi, @schnuerle and @jrstill

I just pushed a new version of the schema. I won't go over all the changes because they can be easily seen through the code diff. A few comments, though:

1 - The "VARCHAR[500]" datatype did not work here in my local database, so I changed it to TEXT. Would that be a problem?

2 - I changed the ids from BIGINT to SERIAL to allow for auto-increment

3 - Irregularities already contain the same date in two formats: milliseconds and string. I chose to keep both as-is and add a new UTC one. Would that be too much?

4 - Is there ever an "irregularity_id" in an Alert? I didn't find it here in my DataFiles nor in Waze's spec.

5 - For Alerts, why aren't we storing the "Confidence" field?

6 - I haven't tested the "coordinates", "roads" and "alert_types" tables. I'll proceed to those upon code review and agreement on the changes made so far.

See below the code (Python) that I used to test the schema. It takes a DataFile, explodes the data and stores it in the AJI tables. The JSON ordered hash has worked properly until now.

https://github.com/GabrielBogo/Joinville-Smart-Mobility/blob/master/src/data/store_data_file.py

@schnuerle
Copy link
Contributor

Thanks for working on it!

1- I think that's fine but it depends on the database we use. We were going to go with Aurora Postgres, so that's why it was varchar. I'll let @jrstill give his thoughts.

2- that's good

3- that makes sense

4- Maybe not, I'll need to double check where that came from.

5- We should be. I added it.

6- ok

7- It looks like you put the ReportPeriod fields we discussed into the data_file table. I'm ok with that.

8- I think we still need a RawRecord table. This is hashed JSON for each AJI record. This would be used to check for and prevent 100% duplicate records (including timestamp) when we are trying to run scripts to replace missing data with JSON files. @jrstill is going to check and see if this could be added as a field in each AJI table, instead of it's own table.

@GabrielBogo
Copy link
Contributor Author

GabrielBogo commented Apr 10, 2018

Good. Let's wait for @jrstill 's input, then.

One question, @schnuerle, regarding the RawRecord table: Wouldn't the field "waze.data_files.json_hash" be enough to guarantee uniqueness, being it the hash of the ordered contents of the entire data file (AJI + Timestamp)?

"date_created" TIMESTAMP,
"date_updated" TIMESTAMP,
"file_name" TEXT NOT NULL,
"json_hash" uuid NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash we create will be SHA1 , so this probably needs to be a varchar(40)

@schnuerle
Copy link
Contributor

schnuerle commented Apr 10, 2018

Well in this case this hash is a subset of the whole file. Eg, each single A, J, or I record. For example, the single Alert below would be hashed with the startTimeMillis from the root file (eg, do a SHA1 on "alertjson:startTimeMillis") to be a unique hash of this alert.

      "country": "US",
      "nThumbsUp": 0,
      "city": "Louisville, KY",
      "reportRating": 0,
      "confidence": 0,
      "reliability": 7,
      "type": "WEATHERHAZARD",
      "uuid": "70797529-beef-357f-bbd7-132607c054ed",
      "roadType": 1,
      "magvar": 353,
      "subtype": "HAZARD_ON_ROAD_CONSTRUCTION",
      "street": "Lamborne Blvd",
      "reportDescription": "Lane and Sidewalk Closure",
      "location": {
        "x": -85.81394933128712,
        "y": 38.11600076069304
      },
      "pubMillis": 1518426000000

"date_updated" TIMESTAMP,
"file_name" TEXT NOT NULL,
"json_hash" uuid NOT NULL,
UNIQUE ("start_time_millis", "json_hash")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan is to actually embed the start_time_millis into the hash, so we won't need a separate unique constraint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed.

@jrstill
Copy link
Contributor

jrstill commented Apr 10, 2018

I think postgres expects varchar(500) instead of varchar[500], so that's probably what was going on there. Aurora (via pgAdmin 4) will accept the [] version but it doesn't actually go in as a varchar 500, I have to use the () version instead. Ultimately it probably doesn't matter, as I think postgres stores varchar and text exactly the same way now (or close enough that it doesn't matter in most cases).

I'm going through and adding some comments in the PR on specific lines, then can circle back here to summarize.

@GabrielBogo
Copy link
Contributor Author

GabrielBogo commented Apr 10, 2018

@schnuerle I still can't see why would that be necessary, given that we are already guaranteeing that the data_files are unique, and all AJI records are Foreign Keyed to a single data_file... Maybe I am losing something?

@schnuerle
Copy link
Contributor

Looks good. As long as a JOIN will still work easily on it from the Jams table.

@schnuerle
Copy link
Contributor

I think the type field should be added to the DB if it’s there - was not left out intentionally. I haven’t seen it before. Maybe they updated the spec again? I’ll check tomorrow.

@jrstill
Copy link
Contributor

jrstill commented Apr 17, 2018

I also just noticed there's a second coordinates list field on a jam (shown in the Waze PDF) called turnLine that isn't in the schema, but I also haven't seen it in any files yet. Any idea if it is a legitimate field, and if so, how does that figure into how we're storing coordinates?

@schnuerle
Copy link
Contributor

I’ve never seen turnline either. Must be new. I guess add it. I’ll have to check for a new spec file tomorrow in the Waze portal unless Gabriel beats me to it.

@GabrielBogo
Copy link
Contributor Author

I have no information either. The latest spec is of August/2017, right?

If we add it, I would just suggest that we add an additional JSONB column, just as with the jam line.

@jrstill
Copy link
Contributor

jrstill commented Apr 18, 2018

If we also want to store the individual coordinates for a turnLine in a separate table, do we add a second coordinate table for those, or add some sort of type on the existing coordinate table? We can't just put both line and turnLine in the one table as-is because we'd have no way to tell them apart.

@schnuerle
Copy link
Contributor

So there is a new spec. Jason I emailed you access. Gabriel you can get it from the CCP partner portal.

It's version 2.8.2 (the .2 is new) though that's in the file name only and not on the title page. But all the new fields we are talking about here are documented in there.

Jason just add these all to the schema for now as you see fit. I think the turn line should go into the same table, but with a 'type' column, since more may be coming.

@jrstill
Copy link
Contributor

jrstill commented Apr 18, 2018

The only difference I see when diff-ing the 2.8 and 2.8.2 files is that Jam speed has been changed from:

Current average speed on jammed segments in Km/h

to

Current average speed on jammed segments in meters/seconds

which seems like kind of a big deal from an analysis standpoint, but doesn't really matter for me in loading the data.

EDIT: Looks like that was probably just a copy/paste mistake they were fixing in the doc, since the data in speed does seem to be the m/s equivalent to the km/h value in speedKMH.

Added setup of default permissions for lambda role (if it has been created)
Added type and turnLine to jams table
Added coordinate_type lookup table (and values)
Added coordinate_type_id to coordinates table
Changed coordinates id to hash
Add Location coordinate type (for alerts)
Fixed syntax error (hanging comma)
Added schema-level permission grant for lambda role
@jrstill
Copy link
Contributor

jrstill commented Apr 20, 2018

I think we're almost there with this schema, it is just irregularities now that have me concerned. There are 4 fields in the schema that aren't in the waze documentation for the file and don't appear to be things we generate from other data:

  1. n_thumbs_up
  2. n_comments
  3. n_images
  4. is_highway

I still haven't seen an actual irregularity record in a file, so I don't know if those are valid fields that could be there and the doc is just wrong, or if the doc is correct and those shouldn't show up.

@schnuerle
Copy link
Contributor

I emailed you 4 examples of Waze Unusual Traffic Alert emails that I signed up for from Waze. I always thought these were irregularities but didn’t double check. They can give you a time and date and location to find in the JSON files.

Not sure what to do about those 4 fields you mentioned, but I wouldn’t let it hold you up for now. We can modify/fix later.

@schnuerle schnuerle added the Phase 1 RDS End to end data processor with hooks and alarms label Apr 20, 2018
@jrstill
Copy link
Contributor

jrstill commented Apr 20, 2018

I picked one (the January 17th at 7:30 am one) and grabbed all the data files from 7:10 to 7:40, but didn't find any irregularities in them. I then setup an AWS Glue crawler to catalog all the data in the bucket we'd previously created in the louisvillebeta account and it looks like it wasn't able to find any files with irregularities either. For now I've got code in the processor that assumes the data will match their spec, and I'm just putting nulls in those extra fields in the DB schema. I think we can run with that until someone actually runs into an irregularity, and even then only if it fails to process.

If it's ok with you, I'm going to go ahead and accept this PR, then merge it over to my branch as well, and then we can look at updating the READMEs before merging my branch back into master. I can start a new PR to give us a place to review that.

@GabrielBogo
Copy link
Contributor Author

@jrstill I will send you in Slack one file that contains irregularities. See if it works for you.

@jrstill
Copy link
Contributor

jrstill commented Apr 20, 2018

Thanks @GabrielBogo, that is exactly what I needed. At first glance, it looks like the fields the DB schema has but the spec didn't are in fact in the file, so I definitely need to update the processor to load them. There are also some other fields in the file that weren't in the spec or our schema. I have to spend the rest of the morning preparing for a meeting, but I'll try to take a closer look this afternoon or over the weekend.

@jrstill
Copy link
Contributor

jrstill commented Apr 23, 2018

Reviewing the irregularities file some more, here's what I've found:

  • causeType: text - in the file @GabrielBogo sent, one of the 2 irregularities had this; this field isn't mentioned in the spec
  • endNode: text - both records in the file had this, though it isn't in the spec; I noticed that the jams table has both startNode and endNode, but not all jam records have both values, so I'm thinking it would be smart to go ahead and add startNode to irregularities as well
  • is_highway/highway: bool - the schema currently has is_highway, but since it wasn't in the spec I was ignoring it; the files have just highway; I think we can leave the schema as-is and just map them in code
  • alerts: array - the irregularities in the example file contain an alerts array, and one of the 2 actually had an alert record it in; the same alert was also in the file in the main alerts list, and as far as I can tell the one on the irregularity record is an exact copy

Of those differences above, I think the alerts list on an irregularity is the only one I really need clarification/guidance on. What do we want to do with that data? Add an irregularity_alert bridge table with irregularity uuid and alert uuid? Just store the raw JSON? Something else?

I'll get working on the other parts while we figure out what to do with that alerts array.

Add new fields for irregularities
@jrstill
Copy link
Contributor

jrstill commented Apr 23, 2018

I've been doing some testing around throwing large quantities of files at the processor in an attempt to make sure users will be able to reprocess old data without issue. There were, initially, a lot of issues, most of which I've worked through. In looking at the postgres stats, I noticed a ton of wait time being spent on LWLock:buffer_content, and Amazon has some info around this specifically (bottom of the page). As a test, I dropped all the FKs and there was definitely a noticeable increase in throughput. I'm not a huge fan of no FKs, because I really like DB consistency, but I wonder if it might make sense for this specific use case to get rid of them. Even for single-file processing, the difference is noticeable.

@schnuerle
Copy link
Contributor

I'm ok with dropping them for now. Maybe they can be clearly documented in a new Issue so we could get back to them later. It could be something we revisit when we look at #32 (and #30 and #31).

@GabrielBogo
Copy link
Contributor Author

GabrielBogo commented Apr 23, 2018

@jrstill, what is the % increase in throughput when dropping all FKs? I am not sure, but it seems to me that a decrease in robustness has the potential to be immensely more costly than a few extra hours (or even days) of historical data processing.
If it's too much, though, I would agree with @schnuerle in leaving this issue to be handled later.

But I think it should be definitely handled sooner or later.

@jrstill
Copy link
Contributor

jrstill commented Apr 23, 2018

I'm rethinking my suggestion. Maybe instead of removing the FKs, which in normal operation (just processing a single file every 2 minutes) will only be a difference of seconds, we instead document that if someone wants to reprocess a lot of historical files they can temporarily remove/disable foreign keys to greatly increase throughput.

@GabrielBogo
Copy link
Contributor Author

I like that!

@schnuerle
Copy link
Contributor

That seems like a potentially complicated thing to document and have govs execute on. I'd err on the side of ease of use. If you think it can be done easily all around, ok. But I imagine that there will be a good number of govs who are already collecting JSON files (including Louisville, and the ones who used our code, or are doing processing on their own, or are using them as part of an existing process, or get them from another source like USDOT to backload) that will want to start with back loading a bunch of data. I don't want to make it difficult for them.

@GabrielBogo
Copy link
Contributor Author

GabrielBogo commented Apr 23, 2018

Yeah, you are right, @schnuerle. No gov should have to tweak the schema to run the processor.

I still worry about robustness, though. If it's only one or two extra seconds per file, it means that the total processing time would be two or three times higher if keeping the FKs, correct? I think it's a low price to pay to ensure consistency in the data.

Do you think it's unfeasible to keep the FKs, @jrstill ?

@jrstill
Copy link
Contributor

jrstill commented Apr 23, 2018

We could definitely provide a script that would disable the FKs, and another to re-enable them, but users would need to be prepared to deal with any issues, even if they are unlikely to happen, that might arise from inconsistent data getting loaded. I'm still leaning toward keeping the FKs but document that if you want to load a ton of existing files and you need it to happen faster, there is a way to do it, but it has risks. Maybe we don't even provide the script, we just describe what can be done with the warning that the user is on their own if they decide to do it, and otherwise they can just wait for however long it takes with FKs enabled.

@jrstill
Copy link
Contributor

jrstill commented Apr 24, 2018

Since it would be easier to remove the FKs later than to add them back in (since there could be inconsistent data), how about we go ahead and accept the PR as-is and just make note of the performance stuff in a new or existing issue (maybe #31)? Then I can merge the sql changes into my branch (since it requires them) and start a PR for it so we can review things before bringing it into master.

@jrstill jrstill merged commit da74ae8 into LouisvilleMetro:master Apr 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Phase 1 RDS End to end data processor with hooks and alarms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants