-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract hyperlinks from wayback machine #501
Comments
Thanks for the issue @yxzhu16. I looked at the source of the page, and here's the HTML where the missing link comes from: <a href="/web/20200903005938/https://www.bbc.com/pidgin/world-54001522" class="Link-sc-1dvfmi3-5 StyledLink-sc-16i2p1z-2 fdDiSd">Five Tyler Perry movies wey make serious money for di Hollywood newest billionaire</a> Is it possible our |
Hi,
The base URI is required to resolve relative URLs using |
Set baseUri to be `src` instead of `base` when extracting links, and deleted `base` parameter. The issue occurred because relative links cannot be extracted by ` link.attr("abs:href")` when baseUri is not set. As I look through the code, param `base` is never provided anywhere when `ExtractLinks` is called, so default value `""` is always used, and baseUri is never set. However, `baseUri` is required to be able to extract relative links. * resolves #501 * update tests * remove unnecessary test results comment Co-authored-by: Kai Zhong <kaizhchn@hotmail.com> Co-authored-by: nruest <ruestn@gmail.com>
Describe the bug
When extracting links from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin which is from Wayback machine, only several links are showing up and most useful links are missing.
To Reproduce
Steps to reproduce the behavior (e.g.):
Expected behavior
Around 100 hyperlinks should show up, and should at least include https://www.bbc.com/pidgin/world-54001522.
Screenshots
Environment information
The text was updated successfully, but these errors were encountered: