Skip to content

Comments

First draft of a pain-points overview#1211

Draft
jamezpolley wants to merge 1 commit intomainfrom
painpoints
Draft

First draft of a pain-points overview#1211
jamezpolley wants to merge 1 commit intomainfrom
painpoints

Conversation

@jamezpolley
Copy link
Member

No description provided.

@jamezpolley jamezpolley self-assigned this Mar 19, 2019
@ghost ghost added the in progress label Mar 19, 2019
@rbtcollins
Copy link

So I'm not sure I'd call this a points of pain overview so much as good docs ;P.

We've discussed a bunch offline; two thoughts that are really only relevant here: your scheduler failure modes are simple bugs; they should be fixed in-situ I think, because that can be done quickly (to whit: don't de-and-requeue things when there is no work slot available - thats not the task failing; use system metrics to inform work slot availability (e.g. if there is io overload, don't schedule more work); immediately place work when slots are freed up (e.g. schedule work immediately at the end of your cleanup of a work slot), cap exponential backoff (e.g. at 5 minutes), discard work after (say) 10 attempts, and finally implement a quick-reset mechanism to zero the queue and allow an immediate restoration of service without mucking around.


## Memory contention + exponential backoff = every-growing retry queue

Until recently, this VM only had 224Gb RAM. It would relatively often
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/224/24/

log-scraping job. This could probably be avoided if we could do
something like feed the logs into syslog or some other central logging
system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mitmproxy troubleshooting

Because mitmproxy intercepts HTTPS traffic, we've got a self-signed CA cert on it which needs to be trusted by the scrapers. Most of the customisation on our buildstep image is trying to get this cert in all the places and trusted by all the things.

However, the dev vagrant image doesn't (unless my memory is playing tricks on me) set this up at all; which makes it hard to reproduce problems. Having a dev setup that did set up SSL by default - in as close as possible to the prod setup - would make it easier to troubleshoot problems.

Stepping back a little though: we really only have mitmproxy in place so that we can record the hosts the scraper is scraping from. If we had a different way to record those hosts, we might not need mitmproxy at all.

Scraper platform updates

At present all scrapers use openaustralia/buildstep:latest. This image is based on cedar:14, which is old, doesn't work at all with PHP, and about to be EOL. We need to update to heroku:16, but this is a breaking change: different set of system packages installed, different versions of languages avalable and so on. #1207 is the start of a plan to allow scrapers to pick a platform so that we can start with a soft migration to heroku:16, and later make other platforms available.

@Br3nda
Copy link
Contributor

Br3nda commented Jul 9, 2025

@jamezpolley Shall we merge this anyway? It's a good thing to have recorded.

@ianheggie-oaf
Copy link
Contributor

@jamezpolley (cc @Br3nda ) - I feel should be reviewed to make sure its still relevant and then merged and if not then closed and marked abandoned - its getting really long in the tooth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants