Finding URLs to follow

Find URLs to follow

UrlFinder is a helper class which can be used to finds URLs in web element attributes.

Create an instance

An instance with the default configuration can be obtained by using the createDefault method. By default, the By.tagName("a") locating mechanism is used and the element's href attribute is searched for URLs.

To create a customized instance, use the UrlFinderBuilder class.

Specify the pattern

Specifies the pattern to use for matching.

Specify the locating mechanisms

Only elements matched by the provided locators will be considered when matching for URLs.

Use:

setLocatingMechanism: to specify a single locator
setLocatingMechanisms: to specify multiple locators

You can find more information on locating mechanisms in the Selenium documentation.

Specify the name of the HTML attribute

Specifies the name of the web element attribute to search for a URL.

Specify the validator function

The provided function should accept URLs as String instances and return a Boolean indicating if the given URL is valid or not. Invalid URLs will be discarded.

The default validator attempts to create a URI instance from the string and parse the domain name. If it succeeds, the URL is considered valid.

(Recommendation: Use the UrlValidator of the Apache Commons Library)

Find URLs

Use:

findAllInResponse: to find all the URLs that match the pattern
findFirstInResponse: to find the URL that first matches the pattern

Example

The following example finds all the valid URLs in the href attribute of every anchor web element in the response.

public class MyCrawler extends Crawler {

    private final UrlFinder urlFinder;

    public MyCrawler(final CrawlerConfiguration config) {
        super(config);

        // A helper class that is intended to make it easier to find URLs on web pages
        urlFinder = UrlFinder.createDefault();
    }

    @Override
    protected void onResponseSuccess(final ResponseSuccessEvent event) {
        // Crawl every URL found on the page
        urlFinder.findAllInResponse(event.getCompleteCrawlResponse())
                .stream()
                .map(CrawlRequest::createDefault)
                .forEach(this::crawl);

        // ...
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly