Skip to content

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Sep 23, 2025

This PR implements a dead hosts detection system that tracks hosts which consistently deny connections and automatically skips them to avoid wasted network timeouts during crawls.

Overview

The dead hosts component addresses a common crawling inefficiency where Zeno would repeatedly attempt to connect to hosts that are permanently unreachable (dead hosts). This results in unnecessary network timeouts and slower crawls. The implementation is inspired by similar functionality in warcprox and the WBM live web checker.

Key Features

  • Automatic Detection: Tracks network-level failures that indicate dead hosts:

    • Connection refused errors
    • DNS lookup failures (no such host, server misbehaving)
    • Network unreachable errors
    • Timeout errors
    • Host unreachable errors
  • Smart Skipping: Once a host is marked as dead (after configurable failures), all subsequent requests to that host are skipped immediately

  • Recovery Mechanism: Dead hosts cache is periodically cleaned up to allow hosts to recover

  • Disabled by Default: Following the suggestion in the issue comments, the feature is disabled by default to avoid potential side effects

  • Robust Error Detection: Uses proper Go error type checking with errors.As() for net.OpError, net.DNSError, and net.Error types, falling back to string matching only when necessary

  • Clean Architecture: Refactored function signatures using ArchiverDependencies struct to group related parameters and improve maintainability

Configuration Options

# Enable dead hosts detection
--dead-hosts-detection

# Number of connection failures before marking a host as dead (default: 15)
--dead-hosts-max-failures 10

# How often to clean up the cache (default: 30m)
--dead-hosts-refresh-period 1h

# How long to keep hosts marked as dead (default: 6h)
--dead-hosts-max-age 12h

Implementation Details

  • Architecture: New deadhosts package in internal/pkg/archiver/deadhosts/
  • Integration: Seamlessly integrated with existing archiver workflow using ArchiverDependencies struct
  • Thread-Safe: Uses sync.RWMutex for efficient concurrent access (chosen over sync.Map due to need for complex cleanup operations)
  • Resource Management: Proper cleanup and goroutine lifecycle management
  • Testing: Comprehensive unit tests and integration tests

Example Usage

# Basic crawl with dead hosts detection enabled
./Zeno get url https://example.com --dead-hosts-detection

# More aggressive settings for faster dead host detection
./Zeno get url https://example.com \
  --dead-hosts-detection \
  --dead-hosts-max-failures 5 \
  --dead-hosts-refresh-period 15m

The component integrates with Zeno's existing rate limiter and follows the same architectural patterns. When enabled, it will log when hosts are being skipped due to being marked as dead, helping operators understand crawl behavior.

Code Quality Improvements

Based on code review feedback, the implementation includes:

  • Improved Default Values: More conservative defaults (15 failures, 6h max-age) to balance efficiency with accuracy
  • Enhanced Error Detection: Type-safe error checking reduces false positives compared to string-only matching
  • Cleaner Function Signatures: Reduced parameter count and improved maintainability through dependency injection pattern
  • Better Documentation: Clear explanations of design choices and architectural decisions

Fixes #361.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: CorentinB <5089772+CorentinB@users.noreply.github.com>
@Copilot Copilot AI changed the title [WIP] Dead hosts component Implement dead hosts component to skip consistently failing hosts Sep 23, 2025
@Copilot Copilot AI requested a review from CorentinB September 23, 2025 08:51
Copilot finished work on behalf of CorentinB September 23, 2025 08:51
@codecov-commenter
Copy link

codecov-commenter commented Sep 23, 2025

Codecov Report

❌ Patch coverage is 94.87179% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.01%. Comparing base (1f0d58f) to head (56a127a).

Files with missing lines Patch % Lines
internal/pkg/archiver/general/archiver.go 61.53% 4 Missing and 1 partial ⚠️
internal/pkg/archiver/worker.go 81.25% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #499      +/-   ##
==========================================
+ Coverage   56.39%   57.01%   +0.61%     
==========================================
  Files         130      131       +1     
  Lines        8120     8272     +152     
==========================================
+ Hits         4579     4716     +137     
- Misses       3172     3184      +12     
- Partials      369      372       +3     
Flag Coverage Δ
e2etests 40.39% <30.76%> (-0.26%) ⬇️
unittests 30.20% <76.92%> (+0.87%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cmd/get.go Outdated
getCmd.PersistentFlags().Duration("rate-limit-cleanup-frequency", time.Duration(5*time.Minute), "How often to run cleanup of stale buckets that are not accessed in the duration.")

// Dead hosts detection flags
getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead-hosts-detection seems lik a better name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the flag name to --dead-hosts-detection as suggested. Commit d7ffd62.

// Dead hosts detection flags
getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.")
getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.")
getCmd.PersistentFlags().Duration("dead-hosts-refresh-period", time.Duration(30*time.Minute), "How often to clean up the dead hosts cache.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be almost irrelevant if we had like a background process dedicated to checking the dead hosts status? Is it a good idea? IDK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea. A background process could periodically check dead hosts to see if they've recovered. For now, I've implemented a cleanup mechanism that removes hosts after a configurable age (default 6h), which allows them to be retried naturally. The background checking approach could be a future enhancement.

)

func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, globalBucketManager *ratelimiter.BucketManager, client *warc.CustomHTTPClient) {
func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, globalBucketManager *ratelimiter.BucketManager, globalDeadHostManager *deadhosts.Manager, client *warc.CustomHTTPClient) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We start to have a lot of arguments to this function, is there a cleaner way to do things?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an ArchiverDependencies struct to group the related parameters, reducing the function signature from 6 parameters to 4. This makes it cleaner and more maintainable. Commit d7ffd62.

// Check if host is marked as dead
if globalDeadHostManager != nil && globalDeadHostManager.IsDeadHost(req.URL.Host) {
logger.Info("skipping request to dead host", "host", req.URL.Host)
item.SetStatus(models.ItemFailed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really Failed though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be similar to retries failing, just accelerated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - it should behave similarly to retries failing, just accelerated. Using ItemFailed is consistent with how other network failures are handled in the system when retries are exhausted.

}

func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, bucketManager *ratelimiter.BucketManager, client *warc.CustomHTTPClient) {
func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, bucketManager *ratelimiter.BucketManager, deadHostManager *deadhosts.Manager, client *warc.CustomHTTPClient) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for archiver.go ArchiveItem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the headless archiver function signature to match the general archiver improvements. Commit d7ffd62.

logger *log.FieldedLogger
globalArchiver *archiver
globalBucketManager *ratelimiter.BucketManager
globalDeadHostManager *deadhosts.Manager
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

globalDeadHostsManager not globalDeadHostManager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the variable naming to globalDeadHostsManager throughout the codebase. Commit d7ffd62.

}

// Initialize dead hosts manager
globalDeadHostManager = deadhosts.NewManager(ctx,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

globalDeadHostsManager not globalDeadHostManager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the variable naming to globalDeadHostsManager throughout the codebase. Commit d7ffd62.

RateLimitCleanupFrequency time.Duration `mapstructure:"rate-limit-cleanup-frequency"`

// Dead hosts detection
DeadHostsEnabled bool `mapstructure:"dead-hosts-enabled"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead-hosts-detection seems lik a better name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the config field name to dead-hosts-detection as suggested. Commit d7ffd62.

Copy link
Collaborator

@NGTmeaty NGTmeaty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// isDeadHostError determines if an error indicates a dead host
func isDeadHostError(err error) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better approach than checking the strings from the error messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the error detection approach using errors.As() to check for specific error types like net.OpError, net.DNSError, and net.Error before falling back to string matching. This is more robust and less brittle. Commit d7ffd62.

// Manager tracks hosts that consistently deny connections
type Manager struct {
mu sync.RWMutex
deadHosts map[string]*hostInfo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a sync.Map? https://pkg.go.dev/sync#Map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.RWMutex is actually better for this use case since we need complex operations like iterating through all entries for cleanup and age checks. sync.Map is optimized for simple key-value operations but doesn't support the kind of bulk operations we need for the cleanup mechanism. Added a comment explaining this choice.

// Check if host is marked as dead
if globalDeadHostManager != nil && globalDeadHostManager.IsDeadHost(req.URL.Host) {
logger.Info("skipping request to dead host", "host", req.URL.Host)
item.SetStatus(models.ItemFailed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be similar to retries failing, just accelerated?

cmd/get.go Outdated

// Dead hosts detection flags
getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.")
getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default should be higher, perhaps 15 or 30?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the default max-failures to 15 as suggested. This provides a better balance between avoiding false positives and detecting genuinely dead hosts. Commit d7ffd62.

cmd/get.go Outdated
getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.")
getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.")
getCmd.PersistentFlags().Duration("dead-hosts-refresh-period", time.Duration(30*time.Minute), "How often to clean up the dead hosts cache.")
getCmd.PersistentFlags().Duration("dead-hosts-max-age", time.Duration(24*time.Hour), "How long to keep a host marked as dead before giving it another chance.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 6 hours may be a better default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the default max-age to 6 hours as suggested. This allows for faster recovery while still providing significant time savings during crawls. Commit d7ffd62.


resp, err = client.Do(req)
if err != nil {
// Record failure for dead host detection
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be clarified to sending the error to the dead host detection function, not necessarily recording it in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the comment to "Send error to dead host detection for analysis" to better reflect that the error is analyzed before being recorded. Commit d7ffd62.

…ion parameters

Co-authored-by: NGTmeaty <2244519+NGTmeaty@users.noreply.github.com>
@Copilot Copilot AI requested a review from NGTmeaty September 26, 2025 02:28
Copilot finished work on behalf of NGTmeaty September 26, 2025 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dead hosts component
4 participants