Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Java ARC/WARC record processing library #494

Closed
ruebot opened this issue Aug 10, 2020 · 1 comment
Closed

Replace Java ARC/WARC record processing library #494

ruebot opened this issue Aug 10, 2020 · 1 comment

Comments

@ruebot
Copy link
Member

ruebot commented Aug 10, 2020

Is your feature request related to a problem? Please describe.

We have a number of issues that have crept up over years with how we process ARC and WARC records to hand off to Spark for processing. Namely #317, #492, and #493.

Describe the solution you'd like

Write a new Scala library to handle processing ARC and WARC. This can be part of aut or and stand alone library, or we can use/built upon @helgeho's sparkling.

Describe alternatives you've considered

Fixing and patching what we have now, and potentially jwarc (#411).

Additional context

Implementing this as a data source could also lead to addressing #371 completely. From the Spark dev list, I believe this is an example of implementing Cassandra as a data source that we can potentially build off of.

@lintool
Copy link
Member

lintool commented Sep 2, 2020

FWIW, Common Crawl seems to use the ClueWeb WARC readers https://github.com/commoncrawl/example-warc-java/tree/master/src/main/java

These are also the ones used in Anserini: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/ClueWeb09Collection.java

My impression is that these readers are much more impoverished in terms of features... but may be much faster?

ruebot added a commit that referenced this issue Sep 30, 2021
* Partially address #494
ruebot added a commit that referenced this issue May 18, 2022
* fix discardDate issue
* update tests for #494
* add test for #493
* add test for #532
* move issue specific tests to their own directory
* add copyright statement to SparklingArchiveRecord
* move webarchive-commons back to 1.1.9
* resolves #532
* resolves #494
* resolves #493
* resolves #492
* resolves #317
* resolves #260
* resolves #182
* resolves #76
* resolves #74
* resolves #73
* resolves #23
* resolves #18
@ruebot ruebot closed this as completed in c8fa256 May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants