-
Notifications
You must be signed in to change notification settings - Fork 46
Implement dead hosts component to skip consistently failing hosts #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: CorentinB <5089772+CorentinB@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #499 +/- ##
==========================================
+ Coverage 56.39% 57.01% +0.61%
==========================================
Files 130 131 +1
Lines 8120 8272 +152
==========================================
+ Hits 4579 4716 +137
- Misses 3172 3184 +12
- Partials 369 372 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cmd/get.go
Outdated
getCmd.PersistentFlags().Duration("rate-limit-cleanup-frequency", time.Duration(5*time.Minute), "How often to run cleanup of stale buckets that are not accessed in the duration.") | ||
|
||
// Dead hosts detection flags | ||
getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead-hosts-detection seems lik a better name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the flag name to --dead-hosts-detection
as suggested. Commit d7ffd62.
// Dead hosts detection flags | ||
getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.") | ||
getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.") | ||
getCmd.PersistentFlags().Duration("dead-hosts-refresh-period", time.Duration(30*time.Minute), "How often to clean up the dead hosts cache.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be almost irrelevant if we had like a background process dedicated to checking the dead hosts status? Is it a good idea? IDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting idea. A background process could periodically check dead hosts to see if they've recovered. For now, I've implemented a cleanup mechanism that removes hosts after a configurable age (default 6h), which allows them to be retried naturally. The background checking approach could be a future enhancement.
) | ||
|
||
func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, globalBucketManager *ratelimiter.BucketManager, client *warc.CustomHTTPClient) { | ||
func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, globalBucketManager *ratelimiter.BucketManager, globalDeadHostManager *deadhosts.Manager, client *warc.CustomHTTPClient) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We start to have a lot of arguments to this function, is there a cleaner way to do things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an ArchiverDependencies
struct to group the related parameters, reducing the function signature from 6 parameters to 4. This makes it cleaner and more maintainable. Commit d7ffd62.
// Check if host is marked as dead | ||
if globalDeadHostManager != nil && globalDeadHostManager.IsDeadHost(req.URL.Host) { | ||
logger.Info("skipping request to dead host", "host", req.URL.Host) | ||
item.SetStatus(models.ItemFailed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really Failed though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be similar to retries failing, just accelerated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - it should behave similarly to retries failing, just accelerated. Using ItemFailed
is consistent with how other network failures are handled in the system when retries are exhausted.
} | ||
|
||
func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, bucketManager *ratelimiter.BucketManager, client *warc.CustomHTTPClient) { | ||
func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, bucketManager *ratelimiter.BucketManager, deadHostManager *deadhosts.Manager, client *warc.CustomHTTPClient) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for archiver.go ArchiveItem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the headless archiver function signature to match the general archiver improvements. Commit d7ffd62.
internal/pkg/archiver/worker.go
Outdated
logger *log.FieldedLogger | ||
globalArchiver *archiver | ||
globalBucketManager *ratelimiter.BucketManager | ||
globalDeadHostManager *deadhosts.Manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
globalDeadHostsManager not globalDeadHostManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the variable naming to globalDeadHostsManager
throughout the codebase. Commit d7ffd62.
internal/pkg/archiver/worker.go
Outdated
} | ||
|
||
// Initialize dead hosts manager | ||
globalDeadHostManager = deadhosts.NewManager(ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
globalDeadHostsManager not globalDeadHostManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the variable naming to globalDeadHostsManager
throughout the codebase. Commit d7ffd62.
internal/pkg/config/config.go
Outdated
RateLimitCleanupFrequency time.Duration `mapstructure:"rate-limit-cleanup-frequency"` | ||
|
||
// Dead hosts detection | ||
DeadHostsEnabled bool `mapstructure:"dead-hosts-enabled"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead-hosts-detection seems lik a better name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the config field name to dead-hosts-detection
as suggested. Commit d7ffd62.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
// isDeadHostError determines if an error indicates a dead host | ||
func isDeadHostError(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better approach than checking the strings from the error messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved the error detection approach using errors.As()
to check for specific error types like net.OpError
, net.DNSError
, and net.Error
before falling back to string matching. This is more robust and less brittle. Commit d7ffd62.
// Manager tracks hosts that consistently deny connections | ||
type Manager struct { | ||
mu sync.RWMutex | ||
deadHosts map[string]*hostInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a sync.Map? https://pkg.go.dev/sync#Map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync.RWMutex is actually better for this use case since we need complex operations like iterating through all entries for cleanup and age checks. sync.Map is optimized for simple key-value operations but doesn't support the kind of bulk operations we need for the cleanup mechanism. Added a comment explaining this choice.
// Check if host is marked as dead | ||
if globalDeadHostManager != nil && globalDeadHostManager.IsDeadHost(req.URL.Host) { | ||
logger.Info("skipping request to dead host", "host", req.URL.Host) | ||
item.SetStatus(models.ItemFailed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be similar to retries failing, just accelerated?
cmd/get.go
Outdated
|
||
// Dead hosts detection flags | ||
getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.") | ||
getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default should be higher, perhaps 15 or 30?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the default max-failures to 15 as suggested. This provides a better balance between avoiding false positives and detecting genuinely dead hosts. Commit d7ffd62.
cmd/get.go
Outdated
getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.") | ||
getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.") | ||
getCmd.PersistentFlags().Duration("dead-hosts-refresh-period", time.Duration(30*time.Minute), "How often to clean up the dead hosts cache.") | ||
getCmd.PersistentFlags().Duration("dead-hosts-max-age", time.Duration(24*time.Hour), "How long to keep a host marked as dead before giving it another chance.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 6 hours may be a better default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the default max-age to 6 hours as suggested. This allows for faster recovery while still providing significant time savings during crawls. Commit d7ffd62.
|
||
resp, err = client.Do(req) | ||
if err != nil { | ||
// Record failure for dead host detection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be clarified to sending the error to the dead host detection function, not necessarily recording it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified the comment to "Send error to dead host detection for analysis" to better reflect that the error is analyzed before being recorded. Commit d7ffd62.
…ion parameters Co-authored-by: NGTmeaty <2244519+NGTmeaty@users.noreply.github.com>
This PR implements a dead hosts detection system that tracks hosts which consistently deny connections and automatically skips them to avoid wasted network timeouts during crawls.
Overview
The dead hosts component addresses a common crawling inefficiency where Zeno would repeatedly attempt to connect to hosts that are permanently unreachable (dead hosts). This results in unnecessary network timeouts and slower crawls. The implementation is inspired by similar functionality in
warcprox
and the WBM live web checker.Key Features
Automatic Detection: Tracks network-level failures that indicate dead hosts:
no such host
,server misbehaving
)Smart Skipping: Once a host is marked as dead (after configurable failures), all subsequent requests to that host are skipped immediately
Recovery Mechanism: Dead hosts cache is periodically cleaned up to allow hosts to recover
Disabled by Default: Following the suggestion in the issue comments, the feature is disabled by default to avoid potential side effects
Robust Error Detection: Uses proper Go error type checking with
errors.As()
fornet.OpError
,net.DNSError
, andnet.Error
types, falling back to string matching only when necessaryClean Architecture: Refactored function signatures using
ArchiverDependencies
struct to group related parameters and improve maintainabilityConfiguration Options
Implementation Details
deadhosts
package ininternal/pkg/archiver/deadhosts/
ArchiverDependencies
structsync.RWMutex
for efficient concurrent access (chosen oversync.Map
due to need for complex cleanup operations)Example Usage
The component integrates with Zeno's existing rate limiter and follows the same architectural patterns. When enabled, it will log when hosts are being skipped due to being marked as dead, helping operators understand crawl behavior.
Code Quality Improvements
Based on code review feedback, the implementation includes:
Fixes #361.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.