-
Notifications
You must be signed in to change notification settings - Fork 15
Finding URLs to follow
UrlFinder is a helper class which can be used to finds URLs in web element attributes.
An instance with the default configuration can be obtained by using the createDefault method. By default, the By.tagName("a")
locating mechanism is used and the element's href
attribute is searched for URLs.
To create a customized instance, use the UrlFinderBuilder class.
Specifies the pattern to use for matching.
Only elements matched by the provided locators will be considered when matching for URLs.
Use:
- setLocatingMechanism: to specify a single locator
- setLocatingMechanisms: to specify multiple locators
You can find more information on locating mechanisms in the Selenium documentation.
Specifies the name of the web element attribute to search for a URL.
The provided function should accept URLs as String
instances and return a Boolean
indicating if the given URL is valid or not. Invalid URLs will be discarded.
The default validator attempts to create a URI
instance from the string and parse the domain name. If it succeeds, the URL is considered valid.
(Recommendation: Use the UrlValidator of the Apache Commons Library)
Use:
- findAllInResponse: to find all the URLs that match the pattern
- findFirstInResponse: to find the URL that first matches the pattern
The following example finds all the valid URLs in the href
attribute of every anchor web element in the response.
public class MyCrawler extends Crawler {
private final UrlFinder urlFinder;
public MyCrawler(final CrawlerConfiguration config) {
super(config);
// A helper class that is intended to make it easier to find URLs on web pages
urlFinder = UrlFinder.createDefault();
}
@Override
protected void onResponseSuccess(final ResponseSuccessEvent event) {
// Crawl every URL found on the page
urlFinder.findAllInResponse(event.getCompleteCrawlResponse())
.stream()
.map(CrawlRequest::createDefault)
.forEach(this::crawl);
// ...
}
}