Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract hyperlinks from wayback machine #501

Closed
yxzhu16 opened this issue Oct 5, 2020 · 3 comments
Closed

Extract hyperlinks from wayback machine #501

yxzhu16 opened this issue Oct 5, 2020 · 3 comments

Comments

@yxzhu16
Copy link
Contributor

yxzhu16 commented Oct 5, 2020

Describe the bug
When extracting links from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin which is from Wayback machine, only several links are showing up and most useful links are missing.

To Reproduce
Steps to reproduce the behavior (e.g.):

  1. Load a WARC crawled from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin
  2. Extract links
  3. Not all of the hyperlinks are showing up

Expected behavior
Around 100 hyperlinks should show up, and should at least include https://www.bbc.com/pidgin/world-54001522.

Screenshots
Screen Shot 2020-10-05 at 12 54 56 PM

Environment information

  • AUT version: 0.80.1-SNAPSHOT
  • OS: MacOS 10.15.6
  • Java version: Java 11
  • Apache Spark version: 3.0.1
  • Apache Spark w/aut: --jars
  • Apache Spark command used to run AUT: run with jupyter notebook
@ianmilligan1
Copy link
Member

Thanks for the issue @yxzhu16.

I looked at the source of the page, and here's the HTML where the missing link comes from:

<a href="/web/20200903005938/https://www.bbc.com/pidgin/world-54001522" class="Link-sc-1dvfmi3-5 StyledLink-sc-16i2p1z-2 fdDiSd">Five Tyler Perry movies wey make serious money for di Hollywood newest billionaire</a>

Is it possible our ExtractLinks use of jsoup isn't picking out those re-written links because they're non-traditional?

@schmika
Copy link

schmika commented Oct 6, 2020

Hi,
I've recently come across the same issue and I think it's because the link references a relative instead of an absolute URL.
In the AUT Scala code, ExtractLinks can have 3 parameters:

* @param src the src link
* @param html the content from which links are to be extracted
* @param base an optional base URI

The base URI is required to resolve relative URLs using link.attr("abs:href"). So I think you have to specify a base URI to be able to extract all links.
At the moment, however, the Python UDF extract_links only expects 2 parameters, if I understand the code correctly. It may be necessary to adapt the Python UDF to include the base parameter.

@ianmilligan1
Copy link
Member

Fantastic stuff, @yxzhu16 – thanks so much for the pull request (and for the info on this too @schmika – much appreciated).

ruebot added a commit that referenced this issue Jan 18, 2021
Set baseUri to be `src` instead of `base` when extracting links, and deleted `base` parameter.

The issue occurred because relative links cannot be extracted by ` link.attr("abs:href")` when baseUri is not set.
As I look through the code, param `base` is never provided anywhere when `ExtractLinks` is called, so default value `""` is always used, and baseUri is never set. However, `baseUri` is required to be able to extract relative links. 

* resolves #501 
* update tests
* remove unnecessary test results comment

Co-authored-by: Kai Zhong <kaizhchn@hotmail.com>
Co-authored-by: nruest <ruestn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants