Extract hyperlinks from wayback machine #501

yxzhu16 · 2020-10-05T16:57:30Z

Describe the bug
When extracting links from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin which is from Wayback machine, only several links are showing up and most useful links are missing.

To Reproduce
Steps to reproduce the behavior (e.g.):

Load a WARC crawled from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin
Extract links
Not all of the hyperlinks are showing up

Expected behavior
Around 100 hyperlinks should show up, and should at least include https://www.bbc.com/pidgin/world-54001522.

Screenshots

Environment information

AUT version: 0.80.1-SNAPSHOT
OS: MacOS 10.15.6
Java version: Java 11
Apache Spark version: 3.0.1
Apache Spark w/aut: --jars
Apache Spark command used to run AUT: run with jupyter notebook

ianmilligan1 · 2020-10-05T17:12:42Z

Thanks for the issue @yxzhu16.

I looked at the source of the page, and here's the HTML where the missing link comes from:

<a href="/web/20200903005938/https://www.bbc.com/pidgin/world-54001522" class="Link-sc-1dvfmi3-5 StyledLink-sc-16i2p1z-2 fdDiSd">Five Tyler Perry movies wey make serious money for di Hollywood newest billionaire</a>

Is it possible our ExtractLinks use of jsoup isn't picking out those re-written links because they're non-traditional?

schmika · 2020-10-06T19:15:30Z

Hi,
I've recently come across the same issue and I think it's because the link references a relative instead of an absolute URL.
In the AUT Scala code, ExtractLinks can have 3 parameters:

* @param src the src link
* @param html the content from which links are to be extracted
* @param base an optional base URI

The base URI is required to resolve relative URLs using link.attr("abs:href"). So I think you have to specify a base URI to be able to extract all links.
At the moment, however, the Python UDF extract_links only expects 2 parameters, if I understand the code correctly. It may be necessary to adapt the Python UDF to include the base parameter.

ianmilligan1 · 2020-10-09T18:15:53Z

Fantastic stuff, @yxzhu16 – thanks so much for the pull request (and for the info on this too @schmika – much appreciated).

Set baseUri to be `src` instead of `base` when extracting links, and deleted `base` parameter. The issue occurred because relative links cannot be extracted by ` link.attr("abs:href")` when baseUri is not set. As I look through the code, param `base` is never provided anywhere when `ExtractLinks` is called, so default value `""` is always used, and baseUri is never set. However, `baseUri` is required to be able to extract relative links. * resolves #501 * update tests * remove unnecessary test results comment Co-authored-by: Kai Zhong <kaizhchn@hotmail.com> Co-authored-by: nruest <ruestn@gmail.com>

yxzhu16 mentioned this issue Oct 8, 2020

Fix relative links extraction #504

Merged

ruebot added bug Scala labels Oct 9, 2020

ruebot closed this as completed in 8435fba Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract hyperlinks from wayback machine #501

Extract hyperlinks from wayback machine #501

yxzhu16 commented Oct 5, 2020

ianmilligan1 commented Oct 5, 2020

schmika commented Oct 6, 2020

ianmilligan1 commented Oct 9, 2020

Extract hyperlinks from wayback machine #501

Extract hyperlinks from wayback machine #501

Comments

yxzhu16 commented Oct 5, 2020

ianmilligan1 commented Oct 5, 2020

schmika commented Oct 6, 2020

ianmilligan1 commented Oct 9, 2020