Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle wget WARC-Target-URI formatting. #515

Merged
merged 1 commit into from
May 11, 2021
Merged

Handle wget WARC-Target-URI formatting. #515

merged 1 commit into from
May 11, 2021

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented May 11, 2021

GitHub issue(s): #514

Handle wget WARC-Target-URI formatting.

What does this Pull Request do?

How should this be tested?

Test should take care of it. Though, if you want to test locally the following Scala and Python examples should do it:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

RecordLoader.loadArchives("/home/nruest/Projects/au/aut/src/test/resources/warc/issue-514.warc", sc).all()
  .select($"crawl_date", $"url")
  .show(2, false)
from aut import *

WebArchive(sc, sqlContext, "/home/nruest/Projects/au/aut/src/test/resources/warc/issue-514.warc") \
  .all() \
  .select("crawl_date", "url") \
  .show(2, False)

Both should produce the following:

+----------+-----------------------------+
|crawl_date|url                          |
+----------+-----------------------------+
|20210511  |http://www.archiveteam.org/  |
|20210511  |https://wiki.archiveteam.org/|
+----------+-----------------------------+

Interested parties

@javieraespinosa let me know if this should resolve your issue. If it does, let me know so we can merge, and I'll cut a release.

- Resolves #514
- Regex replacement for getUrl
- Add test
- Add test fixture
@ruebot ruebot requested a review from ianmilligan1 May 11, 2021 19:52
@codecov
Copy link

codecov bot commented May 11, 2021

Codecov Report

Merging #515 (2d6b2e0) into main (8104a65) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main     #515      +/-   ##
============================================
+ Coverage     88.83%   88.92%   +0.08%     
  Complexity       57       57              
============================================
  Files            43       43              
  Lines          1012     1020       +8     
  Branches         85       85              
============================================
+ Hits            899      907       +8     
  Misses           74       74              
  Partials         39       39              

@javieraespinosa
Copy link

Thanks @ruebot! Works like a charm.

@ianmilligan1 ianmilligan1 merged commit 5cb0665 into main May 11, 2021
@ianmilligan1 ianmilligan1 deleted the issue-514 branch May 11, 2021 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WARC-Target-URI in Wget warc files is not parsed properly
3 participants