Implement dead hosts component to skip consistently failing hosts #499

Copilot · 2025-09-23T08:33:12Z

This PR implements a dead hosts detection system that tracks hosts which consistently deny connections and automatically skips them to avoid wasted network timeouts during crawls.

Overview

The dead hosts component addresses a common crawling inefficiency where Zeno would repeatedly attempt to connect to hosts that are permanently unreachable (dead hosts). This results in unnecessary network timeouts and slower crawls. The implementation is inspired by similar functionality in warcprox and the WBM live web checker.

Key Features

Automatic Detection: Tracks network-level failures that indicate dead hosts:
- Connection refused errors
- DNS lookup failures (no such host, server misbehaving)
- Network unreachable errors
- Timeout errors
- Host unreachable errors
Smart Skipping: Once a host is marked as dead (after configurable failures), all subsequent requests to that host are skipped immediately
Recovery Mechanism: Dead hosts cache is periodically cleaned up to allow hosts to recover
Disabled by Default: Following the suggestion in the issue comments, the feature is disabled by default to avoid potential side effects
Robust Error Detection: Uses proper Go error type checking with errors.As() for net.OpError, net.DNSError, and net.Error types, falling back to string matching only when necessary
Clean Architecture: Refactored function signatures using ArchiverDependencies struct to group related parameters and improve maintainability

Configuration Options

# Enable dead hosts detection
--dead-hosts-detection

# Number of connection failures before marking a host as dead (default: 15)
--dead-hosts-max-failures 10

# How often to clean up the cache (default: 30m)
--dead-hosts-refresh-period 1h

# How long to keep hosts marked as dead (default: 6h)
--dead-hosts-max-age 12h

Implementation Details

Architecture: New deadhosts package in internal/pkg/archiver/deadhosts/
Integration: Seamlessly integrated with existing archiver workflow using ArchiverDependencies struct
Thread-Safe: Uses sync.RWMutex for efficient concurrent access (chosen over sync.Map due to need for complex cleanup operations)
Resource Management: Proper cleanup and goroutine lifecycle management
Testing: Comprehensive unit tests and integration tests

Example Usage

# Basic crawl with dead hosts detection enabled
./Zeno get url https://example.com --dead-hosts-detection

# More aggressive settings for faster dead host detection
./Zeno get url https://example.com \
  --dead-hosts-detection \
  --dead-hosts-max-failures 5 \
  --dead-hosts-refresh-period 15m

The component integrates with Zeno's existing rate limiter and follows the same architectural patterns. When enabled, it will log when hosts are being skipped due to being marked as dead, helping operators understand crawl behavior.

Code Quality Improvements

Based on code review feedback, the implementation includes:

Improved Default Values: More conservative defaults (15 failures, 6h max-age) to balance efficiency with accuracy
Enhanced Error Detection: Type-safe error checking reduces false positives compared to string-only matching
Cleaner Function Signatures: Reduced parameter count and improved maintainability through dependency injection pattern
Better Documentation: Clear explanations of design choices and architectural decisions

Fixes #361.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: CorentinB <5089772+CorentinB@users.noreply.github.com>

codecov-commenter · 2025-09-23T08:59:35Z

Codecov Report

❌ Patch coverage is 94.87179% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.01%. Comparing base (1f0d58f) to head (56a127a).

Files with missing lines	Patch %	Lines
internal/pkg/archiver/general/archiver.go	61.53%	4 Missing and 1 partial ⚠️
internal/pkg/archiver/worker.go	81.25%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #499      +/-   ##
==========================================
+ Coverage   56.39%   57.01%   +0.61%     
==========================================
  Files         130      131       +1     
  Lines        8120     8272     +152     
==========================================
+ Hits         4579     4716     +137     
- Misses       3172     3184      +12     
- Partials      369      372       +3

Flag	Coverage Δ
e2etests	`40.39% <30.76%> (-0.26%)`	⬇️
unittests	`30.20% <76.92%> (+0.87%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

CorentinB · 2025-09-23T08:55:31Z

cmd/get.go

 	getCmd.PersistentFlags().Duration("rate-limit-cleanup-frequency", time.Duration(5*time.Minute), "How often to run cleanup of stale buckets that are not accessed in the duration.")
+
+	// Dead hosts detection flags
+	getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.")


dead-hosts-detection seems lik a better name

Updated the flag name to --dead-hosts-detection as suggested. Commit d7ffd62.

CorentinB · 2025-09-23T08:56:24Z

cmd/get.go

+	// Dead hosts detection flags
+	getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.")
+	getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.")
+	getCmd.PersistentFlags().Duration("dead-hosts-refresh-period", time.Duration(30*time.Minute), "How often to clean up the dead hosts cache.")


This could be almost irrelevant if we had like a background process dedicated to checking the dead hosts status? Is it a good idea? IDK.

That's an interesting idea. A background process could periodically check dead hosts to see if they've recovered. For now, I've implemented a cleanup mechanism that removes hosts after a configurable age (default 6h), which allows them to be retried naturally. The background checking approach could be a future enhancement.

CorentinB · 2025-09-23T08:57:42Z

internal/pkg/archiver/general/archiver.go

 )

-func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, globalBucketManager *ratelimiter.BucketManager, client *warc.CustomHTTPClient) {
+func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, globalBucketManager *ratelimiter.BucketManager, globalDeadHostManager *deadhosts.Manager, client *warc.CustomHTTPClient) {


We start to have a lot of arguments to this function, is there a cleaner way to do things?

Created an ArchiverDependencies struct to group the related parameters, reducing the function signature from 6 parameters to 4. This makes it cleaner and more maintainable. Commit d7ffd62.

CorentinB · 2025-09-23T08:58:25Z

internal/pkg/archiver/general/archiver.go

+	// Check if host is marked as dead
+	if globalDeadHostManager != nil && globalDeadHostManager.IsDeadHost(req.URL.Host) {
+		logger.Info("skipping request to dead host", "host", req.URL.Host)
+		item.SetStatus(models.ItemFailed)


Is it really Failed though?

It should be similar to retries failing, just accelerated?

You're right - it should behave similarly to retries failing, just accelerated. Using ItemFailed is consistent with how other network failures are handled in the system when retries are exhausted.

CorentinB · 2025-09-23T08:59:03Z

internal/pkg/archiver/headless/archiver.go

 }

-func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, bucketManager *ratelimiter.BucketManager, client *warc.CustomHTTPClient) {
+func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, bucketManager *ratelimiter.BucketManager, deadHostManager *deadhosts.Manager, client *warc.CustomHTTPClient) {


Same comment as for archiver.go ArchiveItem.

Updated the headless archiver function signature to match the general archiver improvements. Commit d7ffd62.

CorentinB · 2025-09-23T08:59:29Z

internal/pkg/archiver/worker.go

-	logger              *log.FieldedLogger
+	globalArchiver        *archiver
+	globalBucketManager   *ratelimiter.BucketManager
+	globalDeadHostManager *deadhosts.Manager


globalDeadHostsManager not globalDeadHostManager

Fixed the variable naming to globalDeadHostsManager throughout the codebase. Commit d7ffd62.

CorentinB · 2025-09-23T08:59:47Z

internal/pkg/archiver/worker.go

 		}
+
+		// Initialize dead hosts manager
+		globalDeadHostManager = deadhosts.NewManager(ctx,


globalDeadHostsManager not globalDeadHostManager

Fixed the variable naming to globalDeadHostsManager throughout the codebase. Commit d7ffd62.

CorentinB · 2025-09-23T09:00:08Z

internal/pkg/config/config.go

 	RateLimitCleanupFrequency time.Duration `mapstructure:"rate-limit-cleanup-frequency"`

+	// Dead hosts detection
+	DeadHostsEnabled        bool          `mapstructure:"dead-hosts-enabled"`


dead-hosts-detection seems lik a better name

Updated the config field name to dead-hosts-detection as suggested. Commit d7ffd62.

NGTmeaty

@copilot

NGTmeaty · 2025-09-26T02:11:15Z

internal/pkg/archiver/deadhosts/deadhosts.go

+}
+
+// isDeadHostError determines if an error indicates a dead host
+func isDeadHostError(err error) bool {


Is there a better approach than checking the strings from the error messages?

Improved the error detection approach using errors.As() to check for specific error types like net.OpError, net.DNSError, and net.Error before falling back to string matching. This is more robust and less brittle. Commit d7ffd62.

NGTmeaty · 2025-09-26T02:11:19Z

internal/pkg/archiver/deadhosts/deadhosts.go

+// Manager tracks hosts that consistently deny connections
+type Manager struct {
+	mu            sync.RWMutex
+	deadHosts     map[string]*hostInfo


Why not use a sync.Map? https://pkg.go.dev/sync#Map

sync.RWMutex is actually better for this use case since we need complex operations like iterating through all entries for cleanup and age checks. sync.Map is optimized for simple key-value operations but doesn't support the kind of bulk operations we need for the cleanup mechanism. Added a comment explaining this choice.

NGTmeaty · 2025-09-26T02:14:15Z

internal/pkg/archiver/general/archiver.go

+	// Check if host is marked as dead
+	if globalDeadHostManager != nil && globalDeadHostManager.IsDeadHost(req.URL.Host) {
+		logger.Info("skipping request to dead host", "host", req.URL.Host)
+		item.SetStatus(models.ItemFailed)


It should be similar to retries failing, just accelerated?

NGTmeaty · 2025-09-26T02:14:59Z

cmd/get.go

+
+	// Dead hosts detection flags
+	getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.")
+	getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.")


Default should be higher, perhaps 15 or 30?

Updated the default max-failures to 15 as suggested. This provides a better balance between avoiding false positives and detecting genuinely dead hosts. Commit d7ffd62.

NGTmeaty · 2025-09-26T02:15:13Z

cmd/get.go

+	getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.")
+	getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.")
+	getCmd.PersistentFlags().Duration("dead-hosts-refresh-period", time.Duration(30*time.Minute), "How often to clean up the dead hosts cache.")
+	getCmd.PersistentFlags().Duration("dead-hosts-max-age", time.Duration(24*time.Hour), "How long to keep a host marked as dead before giving it another chance.")


I think 6 hours may be a better default.

Updated the default max-age to 6 hours as suggested. This allows for faster recovery while still providing significant time savings during crawls. Commit d7ffd62.

NGTmeaty · 2025-09-26T02:16:24Z

internal/pkg/archiver/general/archiver.go


 		resp, err = client.Do(req)
 		if err != nil {
+			// Record failure for dead host detection


This should be clarified to sending the error to the dead host detection function, not necessarily recording it in.

Clarified the comment to "Send error to dead host detection for analysis" to better reflect that the error is analyzed before being recorded. Commit d7ffd62.

…ion parameters Co-authored-by: NGTmeaty <2244519+NGTmeaty@users.noreply.github.com>

Initial plan

141a5a3

Copilot AI assigned Copilot and CorentinB Sep 23, 2025

Copilot started work on behalf of CorentinB September 23, 2025 08:33 View session

Implement dead hosts component with configuration and integration

56a127a

Co-authored-by: CorentinB <5089772+CorentinB@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Dead hosts component~~ Implement dead hosts component to skip consistently failing hosts Sep 23, 2025

Copilot AI requested a review from CorentinB September 23, 2025 08:51

Copilot finished work on behalf of CorentinB September 23, 2025 08:51

CorentinB requested changes Sep 23, 2025

View reviewed changes

NGTmeaty reviewed Sep 26, 2025

View reviewed changes

Copilot started work on behalf of NGTmeaty September 26, 2025 02:17 View session

Address feedback: fix naming, improve error detection, refactor funct…

d7ffd62

…ion parameters Co-authored-by: NGTmeaty <2244519+NGTmeaty@users.noreply.github.com>

Copilot AI requested a review from NGTmeaty September 26, 2025 02:28

Copilot finished work on behalf of NGTmeaty September 26, 2025 02:28

Implement dead hosts component to skip consistently failing hosts #499

Are you sure you want to change the base?

Implement dead hosts component to skip consistently failing hosts #499

Uh oh!

Conversation

Copilot AI commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Features

Configuration Options

Implementation Details

Example Usage

Code Quality Improvements

Uh oh!

codecov-commenter commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

NGTmeaty left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Sep 23, 2025 •

edited

Loading

codecov-commenter commented Sep 23, 2025 •

edited

Loading