Skip to content

Finding URLs to follow

Péter Bencze edited this page Jun 18, 2019 · 1 revision

Find URLs to follow

UrlFinder is a helper class which can be used to finds URLs in web element attributes.

Create an instance

An instance with the default configuration can be obtained by using the createDefault method. By default, the By.tagName("a") locating mechanism is used and the element's href attribute is searched for URLs.

To create a customized instance, use the UrlFinderBuilder class.

Specifies the pattern to use for matching.

Specify the locating mechanisms

Only elements matched by the provided locators will be considered when matching for URLs.

Use:

You can find more information on locating mechanisms in the Selenium documentation.

Specifies the name of the web element attribute to search for a URL.

The provided function should accept URLs as String instances and return a Boolean indicating if the given URL is valid or not. Invalid URLs will be discarded.

The default validator attempts to create a URI instance from the string and parse the domain name. If it succeeds, the URL is considered valid.

(Recommendation: Use the UrlValidator of the Apache Commons Library)

Find URLs

Use:


Example

The following example finds all the valid URLs in the href attribute of every anchor web element in the response.

public class MyCrawler extends Crawler {

    private final UrlFinder urlFinder;

    public MyCrawler(final CrawlerConfiguration config) {
        super(config);

        // A helper class that is intended to make it easier to find URLs on web pages
        urlFinder = UrlFinder.createDefault();
    }

    @Override
    protected void onResponseSuccess(final ResponseSuccessEvent event) {
        // Crawl every URL found on the page
        urlFinder.findAllInResponse(event.getCompleteCrawlResponse())
                .stream()
                .map(CrawlRequest::createDefault)
                .forEach(this::crawl);

        // ...
    }
}