Ideas for improving import from YNR #300

symroe · 2018-10-23T08:28:37Z

Over in #297 Chris made some useful comments. Adding them here so they don't get lost

I've left a few comments inline, but while I've been reading this I've also been thinking about other places to go with this. If we still have problems I reckon there is way more mileage to improve or optimise this process. I'm not saying these are things we need to look at right now, but here a few ideas.. I mainly want to write them down while I think of them:

One of the issues we've got at the moment is we've been running a full sync from YNR more frequently than YNR generates the cached API responses which means every so often we regress to old data and then have to do a massive incremental sync just after each full sync. This also gets worse throughout the day. This is a fiddly thing to manage because we've got a dependency on 2 cron jobs running on different servers. Depending on what time of year it is etc, we might want to change those frequencies and that dependency between those 2 values is a difficult thing to regulate. I wonder if there's a smarter way to have the full sync run more frequently but as a first step inspect YNR to see if the cached-api response has changed since it last imported from it (and if it hasn't, quit() ). That way if you change how often it is generated, WhoCIVF can respond to that instead of us having to remember to update in 2 places. Then we don't end up with this condition.
This process seems like it would be a good candidate for looking at whether we can improve performance by doing some of this asynchronously.

If you think about add_people(), we want to do a load of insert ops. We don't really care about the result returned by any of those DB queries but its speed-critical and we've got an I/O blocking operation in a loop. At the moment we need to send a DB query off, wait for a it to complete so we can discard its result before we fire off another DB query so we can wait for that DB query to complete and discard its result and so on.. then once all that DB I/O is complete we can start processing the next file.

If we were able to fire them off asynchronously I wonder if we would bash through this quicker if we could send them all off, let postgres deal with them and immediately crack on with the next file?
At the moment we treat this as a process that runs and stops and then runs again which we shedule with cron to run every minute. In order to avoid problems, we want to keep its run time under 1 minute under all circumstances. Ultimately if we make a sufficient number of edits on YNR, there's always scope for this to take too long, overlap iself and make a horrible mess. Another (more correct but also trickier) way to look at this would be as a daemon that permanently runs in the background from boot and each time it runs schedules itself to run again in N minutes. That would ensure it doesn't overlap itself. I wonder if there is a good way to daemonise a django management command?

The text was updated successfully, but these errors were encountered:

chris48s · 2019-03-19T20:26:44Z

Another thought.. would managing tasks using some kind of worker/queue based solution be an improvement on scheduling with Cron?

chris48s · 2019-10-31T09:28:06Z

Predictably, there are tools to help with this exact problem:

Over the next few weeks we're into peak season for this problem. I think its worth looking at one of these as a stop-gap.

symroe · 2019-11-13T19:34:37Z

I've implemented lockrun on prod, as of the two options above this allowed essentially a bespoke lockfile that could be shared across different commands.

This means that I can share a lock between different commands that might interrupt each other – like importing ballots and importing recent people.

run-one simply locks per unique set of arguments passed to it.

chris48s · 2019-11-13T19:48:38Z

Nice - that's pretty handy.
I guess we'll get a good test of how well it works tomorrow :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas for improving import from YNR #300

Ideas for improving import from YNR #300

symroe commented Oct 23, 2018

chris48s commented Mar 19, 2019

chris48s commented Oct 31, 2019

symroe commented Nov 13, 2019

chris48s commented Nov 13, 2019

Ideas for improving import from YNR #300

Ideas for improving import from YNR #300

Comments

symroe commented Oct 23, 2018

chris48s commented Mar 19, 2019

chris48s commented Oct 31, 2019

symroe commented Nov 13, 2019

chris48s commented Nov 13, 2019