Skip to content

Objective 2: Propose a better solution #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

imkh
Copy link
Owner

@imkh imkh commented Feb 13, 2025

Originally created on 19 August 2024 at 00:23 GMT+2


The goal of this PR is to propose a better architecture to the current program, one more robust and capable of handling a file with millions of URLs.

1. Refactor the program with the implementation of a worker pool

With a file containing millions of URLs, the previous code would create a goroutine for each, without any regards for system resources. A common pattern to handle this kind of use case is to implement a worker pool. We can combine 2 different channels for communication (one for the number of jobs to work through, and another for the results) with a fixed number of workers which will run concurrently and pick the URLs to process one-by-one, until they're all done.

This limits the number of goroutines to at most the number of concurrent workers (plus 2 or 3 more, for various setup and checking when it's all done), which prevents making too many requests & using too much memory at once. Communication is done through channels to track the progress of the work, and a WaitGroup is once again used to wait for all the workers to finish.

2. Remove the GetServices function and process URLs as soon as they're read

The file reading step is moved directly into the worker pool logic. As soon as a URL is read, it's sent to the corresponding channel so that it can be processed directly.

3. Return results as soon as they're processed

Similarly, the second channel outputs result as soon as they're done. This gives faster feedback to a user or another program calling the new code.

4. Encapsulated WorkerPool struct to offer sensible defaults and configuration options

The WorkerPool struct contains the entire concurrency logic and publicly exposes a constructor that sets sensible defaults and allows for the customization of the following options:

  • Custom HTTP client to specify a custom timeout value or other HTTP options (default client is set to a timeout of 30 seconds)
  • Number of concurrent workers (default is set to 10, and can be configured between 1 and 100). The main function also exposes a -workers command-line flags, with the value passed to the WorkerPool constructor.

Other potential improvements:

  1. Adding a context to handle cancellation events such as SIGINT/CTRL+C or SIGTERM.
  2. Error handling for scanner.Err(). I wasn't able to trigger the error in my testing, but I ran into issue with that in the past with other programs, particularly when reading from stdin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant