-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DC2021 issues #72
Comments
Hmm, the problem with the surts file was likely a file permissions thing. |
Having restarted with a bit more RAM and with the After 18 hours, a quick performance analysis. Most threads seem to be setting up or using HTTP connections, which is good. About 80 are waiting for a lock related to queue rotation:
...where this is the lock-holder, which seems busy with cache/BDB eviction...
Oddly, there are many threads awaiting the same lock, but reporting it as owned by different threads. This is perhaps the lock very rapidly being handed from thread to thread while the thread stack report is being collected for printing. So, the speed of managing the Frontier queues appears to be the bottleneck, with the global lock on queue rotation somewhat amplifying this effect. |
After scaling down (600 > 400 ToeThreads) it seems stable. Was a bit weird for a while as I accidentally made it re-scan the full seed list, but it's settled down again now. Running okay, probably is roughly two-thirds speed! Of roughly 200-250 threads in the So, making OCDX faster is something to consider! What speed disk is it on? Notes imply vanilla Although the machine is heavily loaded, so maybe that's part of the reason OCDX is not able to respond more quickly? |
The issues were largely resolved at this point. Notes are held elsewhere. |
A number of issues with the DC2021 crawl.
Note that .uk seeds were accidentally marked as full seeds despite already being in .uk scope, and this is likely part of the problem as the whole system has to make and manage a massive augmented seed file.
But there also appear to be issues with H3 we should try to resolve.
There appear to be problems with cookie expiration: internetarchive/heritrix3#427
Then there are problems related to seed management (too many seeds)...
quite a few of these, which appear to be a problem with how ExtractorXML expects things to work - perhaps there no content to get?
lots of these, which are harmless and long-standing, but it is irritating that dead domains are not handled more elegantly...
(registered issue about this here)
and then the big problem - a good chunk of these...
At which point, all bets are off. There's some downstream grumbling about lock timeouts, but you know, after running out of memory everything is wonky.
I think the OOM stems from the seed problem, but we may as well up the heap allocation anyway.
The text was updated successfully, but these errors were encountered: