Output filtered data to WARC Format #147

greebie · 2017-12-08T19:48:37Z

Users may desire outputs in WARC format after filtering their RDD[ArchiveRecord].

greebie · 2018-01-08T16:00:33Z

@dportabella has an example to implement this here: https://gist.github.com/dportabella/3caf261c218a4448a03a14dbc06fe730 .

The other alternative is the more detailed WARCWriter class from iipc:
https://github.com/iipc/webarchive-commons/blob/master/src/main/java/org/archive/io/warc/WARCWriter.java which has me confused, honestly.

This feature has potential to be dangerous, as there is no real way to test the total size of the request. Take for example this pseudocode:

var record = RecordLoader('filePath', sc) .map(x => SaveToWARC(record))

which would save the entire Warc for every ArchiveRecord in record. It would be a juggernaut that will not stop until the server explodes due to lack of fileSpace.

I have to admit to being a little lost to the finer details of producing and saving a WARC files here, and it's Monday, so am prone to laziness. Advice @ruebot, @lintool and @ianmilligan1 ?

ianmilligan1 · 2018-01-08T16:07:00Z

The example looks promising, @greebie! I'm not too hung up on the danger, as long as the feature is well documented. But maybe I'm naive. The others may think differently, but my gut is that taking a stab at using @dportabella's example and seeing if it can play with AUT is probably the most fruitful way forward?

We can also discuss tomorrow.

dportabella · 2018-01-08T16:10:27Z

it would be nice to also create the cdx index at the same time.

greebie · 2018-01-08T16:19:03Z

Producing the cdx would be a safe start for testing purposes, actually. Thanks dportabella!

greebie · 2018-01-11T14:02:28Z

Backing away from this issue for now until we find someone with better understanding of the iipc toolkit.

ianmilligan1 · 2019-07-19T11:06:44Z

I think our conversations have largely moved away from the idea of creating new WARC files, and really focusing on derivative datasets. I think given this move in the project, we could consider closing this?

dportabella · 2019-07-20T07:27:38Z

I still think that filtering WARC files is an important task that AUT can solve.

ianmilligan1 · 2019-07-22T13:24:10Z

Thanks @dportabella! My sense is that our team's time is limited to make this a short or medium-term issue for us, but any chance you'd be interested in opening up a PR based on the example code that @greebie shared up above?

dportabella · 2019-07-22T13:34:11Z

I shared a gist on achieving this task (included in @greebie comment above), and I am currently using this approach.
I don't know much the details of the AUT library, and I don't have time to get into it, sorry :(

sepastian · 2020-07-20T07:16:13Z

Adding writing of WARC records to the current AUT is one way of solving this.

But I would rather go with @ruebot's suggestion of hooking into existing infrastructure and creating a Spark Data Source, see #371.

ianmilligan1 changed the title ~~Output filtered data to Warc Format~~ Output filtered data to WARC Format Dec 19, 2017

greebie self-assigned this Dec 19, 2017

ianmilligan1 added the enhancement label Jan 6, 2018

ianmilligan1 added the RA-Task label Jan 11, 2018

ianmilligan1 unassigned greebie Aug 3, 2018

ruebot mentioned this issue Nov 5, 2019

Convert RecordLoader.loadArchives to a Spark Data Source #371

Closed

ruebot removed the RA-Task label May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output filtered data to WARC Format #147

Output filtered data to WARC Format #147

greebie commented Dec 8, 2017 •

edited by ianmilligan1

Loading

greebie commented Jan 8, 2018

ianmilligan1 commented Jan 8, 2018

dportabella commented Jan 8, 2018

greebie commented Jan 8, 2018

greebie commented Jan 11, 2018

ianmilligan1 commented Jul 19, 2019

dportabella commented Jul 20, 2019

ianmilligan1 commented Jul 22, 2019

dportabella commented Jul 22, 2019

sepastian commented Jul 20, 2020

Output filtered data to WARC Format #147

Output filtered data to WARC Format #147

Comments

greebie commented Dec 8, 2017 • edited by ianmilligan1 Loading

greebie commented Jan 8, 2018

ianmilligan1 commented Jan 8, 2018

dportabella commented Jan 8, 2018

greebie commented Jan 8, 2018

greebie commented Jan 11, 2018

ianmilligan1 commented Jul 19, 2019

dportabella commented Jul 20, 2019

ianmilligan1 commented Jul 22, 2019

dportabella commented Jul 22, 2019

sepastian commented Jul 20, 2020

greebie commented Dec 8, 2017 •

edited by ianmilligan1

Loading