Objective 2: Propose a better solution #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Originally created on 19 August 2024 at 00:23 GMT+2
The goal of this PR is to propose a better architecture to the current program, one more robust and capable of handling a file with millions of URLs.
1. Refactor the program with the implementation of a worker pool
With a file containing millions of URLs, the previous code would create a goroutine for each, without any regards for system resources. A common pattern to handle this kind of use case is to implement a worker pool. We can combine 2 different channels for communication (one for the number of jobs to work through, and another for the results) with a fixed number of workers which will run concurrently and pick the URLs to process one-by-one, until they're all done.
This limits the number of goroutines to at most the number of concurrent workers (plus 2 or 3 more, for various setup and checking when it's all done), which prevents making too many requests & using too much memory at once. Communication is done through channels to track the progress of the work, and a WaitGroup is once again used to wait for all the workers to finish.
2. Remove the GetServices function and process URLs as soon as they're read
The file reading step is moved directly into the worker pool logic. As soon as a URL is read, it's sent to the corresponding channel so that it can be processed directly.
3. Return results as soon as they're processed
Similarly, the second channel outputs result as soon as they're done. This gives faster feedback to a user or another program calling the new code.
4. Encapsulated WorkerPool struct to offer sensible defaults and configuration options
The
WorkerPool
struct contains the entire concurrency logic and publicly exposes a constructor that sets sensible defaults and allows for the customization of the following options:-workers
command-line flags, with the value passed to the WorkerPool constructor.Other potential improvements:
scanner.Err()
. I wasn't able to trigger the error in my testing, but I ran into issue with that in the past with other programs, particularly when reading from stdin.