-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas for improving import from YNR #300
Comments
Another thought.. would managing tasks using some kind of worker/queue based solution be an improvement on scheduling with Cron? |
Predictably, there are tools to help with this exact problem:
Over the next few weeks we're into peak season for this problem. I think its worth looking at one of these as a stop-gap. |
I've implemented This means that I can share a lock between different commands that might interrupt each other – like importing ballots and importing recent people.
|
Nice - that's pretty handy. |
Over in #297 Chris made some useful comments. Adding them here so they don't get lost
I've left a few comments inline, but while I've been reading this I've also been thinking about other places to go with this. If we still have problems I reckon there is way more mileage to improve or optimise this process. I'm not saying these are things we need to look at right now, but here a few ideas.. I mainly want to write them down while I think of them:
One of the issues we've got at the moment is we've been running a full sync from YNR more frequently than YNR generates the cached API responses which means every so often we regress to old data and then have to do a massive incremental sync just after each full sync. This also gets worse throughout the day. This is a fiddly thing to manage because we've got a dependency on 2 cron jobs running on different servers. Depending on what time of year it is etc, we might want to change those frequencies and that dependency between those 2 values is a difficult thing to regulate. I wonder if there's a smarter way to have the full sync run more frequently but as a first step inspect YNR to see if the
cached-api
response has changed since it last imported from it (and if it hasn't,quit()
). That way if you change how often it is generated, WhoCIVF can respond to that instead of us having to remember to update in 2 places. Then we don't end up with this condition.This process seems like it would be a good candidate for looking at whether we can improve performance by doing some of this asynchronously.
If you think about
add_people()
, we want to do a load of insert ops. We don't really care about the result returned by any of those DB queries but its speed-critical and we've got an I/O blocking operation in a loop. At the moment we need to send a DB query off, wait for a it to complete so we can discard its result before we fire off another DB query so we can wait for that DB query to complete and discard its result and so on.. then once all that DB I/O is complete we can start processing the next file.If we were able to fire them off asynchronously I wonder if we would bash through this quicker if we could send them all off, let postgres deal with them and immediately crack on with the next file?
At the moment we treat this as a process that runs and stops and then runs again which we shedule with cron to run every minute. In order to avoid problems, we want to keep its run time under 1 minute under all circumstances. Ultimately if we make a sufficient number of edits on YNR, there's always scope for this to take too long, overlap iself and make a horrible mess. Another (more correct but also trickier) way to look at this would be as a daemon that permanently runs in the background from boot and each time it runs schedules itself to run again in N minutes. That would ensure it doesn't overlap itself. I wonder if there is a good way to daemonise a django management command?
The text was updated successfully, but these errors were encountered: