Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271

Closed
ruebot opened this issue Sep 18, 2018 · 23 comments

Comments

@ruebot
Copy link
Member

ruebot commented Sep 18, 2018

Describe the bug
Came across this when processing a user's collection on cloud.archivesunleashed.org, using aut-0.16.0. The collection appears to have a couple problematic warcs, which throw this error:

2018-09-17 22:20:13,106 [Executor task launch worker for task 17270] INFO  NewHadoopRDD - Input split: file:/data/139/499/warcs/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-09-17 22:20:13,840 [Executor task launch worker for task 17269] ERROR Executor - Exception in task 17269.0 in stage 0.0 (TID 17269)
java.util.zip.ZipException: too many length or distance symbols
        at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
        at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
        at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
        at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
        at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
        at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
        at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
        at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
        at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
        at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
        at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
        at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
        at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
        at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
        at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
        at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
        at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:100)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Fuller log output available here.

To Reproduce

      import io.archivesunleashed._
      import io.archivesunleashed.app._
      import io.archivesunleashed.matchbox._
      sc.setLogLevel("INFO")
      RecordLoader.loadArchives("/data/139/499/warcs/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/data/139/499/60/derivatives/all-domains/output")
      RecordLoader.loadArchives("/data/139/499/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/data/139/499/60/derivatives/all-text/output")
      val links = RecordLoader.loadArchives("/data/139/499/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
      WriteGraphML(links, "/data/139/499/60/derivatives/gephi/499-gephi.graphml")
      sys.exit

The error occurs at this specific point:

      RecordLoader.loadArchives("/data/139/499/warcs/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/data/139/499/60/derivatives/all-domains/output")

Environment information

  • AUT version: 0.16.0
  • OS: Ubuntu 16.04.5 LTS (GNU/Linux 4.4.0-134-generic x86_64)
  • Java version: OpenJDK (openjdk version "1.8.0_181")
  • Apache Spark version: 2.1.3
  • Apache Spark w/aut: --packages
  • Apache Spark command used to run AUT: /home/ubuntu/aut/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[12] --driver-memory 105G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --packages \"io.archivesunleashed:aut:0.16.0\" -i /data/139/499/60/spark_jobs/499.scala | tee /data/139/499/60/spark_jobs/499.scala.log

Additional context

@ruebot
Copy link
Member Author

ruebot commented Sep 18, 2018

Same thing on text extraction now, at the same point (same file):

RecordLoader.loadArchives("/data/139/499/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/data/139/499/60/derivatives/all-text/output")
2018-09-18 22:45:30,236 [Executor task launch worker for task 34541] INFO  NewHadoopRDD - Input split: file:/data/139/499/warcs/ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz:0+130144728
2018-09-18 22:45:30,305 [Executor task launch worker for task 34541] INFO  FileOutputCommitter - File Output Committer Algorithm version is 1
2018-09-18 22:45:31,051 [Executor task launch worker for task 34541] ERROR Utils - Aborting task
java.util.zip.ZipException: too many length or distance symbols
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1210)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1356)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-09-18 22:45:31,053 [Executor task launch worker for task 34541] ERROR Executor - Exception in task 17269.0 in stage 2.0 (TID 34541)
java.util.zip.ZipException: too many length or distance symbols
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1210)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1356)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-09-18 22:45:31,061 [dispatcher-event-loop-8] INFO  TaskSetManager - Starting task 17270.0 in stage 2.0 (TID 34542, localhost, executor driver, partition 17270, PROCESS_LOCAL, 19621 bytes)
2018-09-18 22:45:31,061 [Executor task launch worker for task 34542] INFO  Executor - Running task 17270.0 in stage 2.0 (TID 34542)
2018-09-18 22:45:31,062 [task-result-getter-0] WARN  TaskSetManager - Lost task 17269.0 in stage 2.0 (TID 34541, localhost, executor driver): java.util.zip.ZipException: too many length or distance symbols
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1210)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1356)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

2018-09-18 22:45:31,062 [task-result-getter-0] ERROR TaskSetManager - Task 17269 in stage 2.0 failed 1 times; aborting job

@ianmilligan1
Copy link
Member

Thanks for the update @ruebot! Could you put the file on rho- I’d love to poke at it tomorrow afternoon.

@ruebot
Copy link
Member Author

ruebot commented Sep 19, 2018

/mnt/vol1/data_sets/auk_datasets/aut-issue-271

@ianmilligan1
Copy link
Member

Individually, the WARC is fine - have been able to extract domains and plain text from it. Hmm.

@ianmilligan1
Copy link
Member

I see in the second fail merge that it's actually a different file: file:/data/139/499/warcs/ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz

@borislin
Copy link
Collaborator

@ruebot Could you help put the file on tuna? I'm looking into this issue.

@ianmilligan1
Copy link
Member

@borislin I put ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz in /tuna1/scratch/i2milligan. I think you've got permissions to grab the file from there but let me know if you run into trouble.

@ruebot if you have the file handy could you move ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz to tuna as well?

That said, with ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz I was not able to individually reproduce the problem.

As noted above it's similar to #246 we think.

@ruebot
Copy link
Member Author

ruebot commented Sep 27, 2018

I'm moving all the problem files over right now. It's taking some time, and I'll let y'all know when I'm done.

@ianmilligan1
Copy link
Member

Thx @ruebot - and thanks too @borislin! Let us know if we can help in any way.

@ruebot
Copy link
Member Author

ruebot commented Sep 27, 2018

/home/ruestn/499-issues

There are 54 files there. They are a mix of files, that include:

@borislin
Copy link
Collaborator

@ruebot @ianmilligan1

What are the known files that are causing this ZipException issue? Only ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz and ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz ?

@ruebot
Copy link
Member Author

ruebot commented Sep 28, 2018

Sounds right to me. Those are the ones in the error log. There is a link to a gist above too if you want to double check.

@borislin
Copy link
Collaborator

@ruebot @ianmilligan1

I also can't reproduce the error for ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz. Do you still see the exception error on your end when you run aut on this file?

For ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz, I got a warning instead of an error. No exception was really thrown from https://github.com/archivesunleashed/aut/blob/master/src/main/java/io/archivesunleashed/data/ArchiveRecordInputFormat.java#L175 and https://github.com/archivesunleashed/aut/blob/master/src/main/java/io/archivesunleashed/data/ArchiveRecordInputFormat.java#L186. The logger just logged the exception as an warning and program proceeded. Do you see the same behaviour on your end? My aut version is the current master branch version.

The command I've used is:

b25lin@tuna:~/aut$ /home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[12] --driver-memory 105G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar" -i /home/b25lin/spark_jobs/499.scala > 499.log

2018-09-30 16:15:10,647 [Executor task launch worker for task 4] WARN  ARCReaderFactory$CompressedARCReader$1 - Trying skip of failed record cleanup of file:/home/ruestn/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz: {subject-uri=http://nris.mt.gov/nsdi/data/doqq/spc/tif/d48106/d4810666SW.tif, ip-address=161.7.9.212, origin=, length=41799444, absolute-offset=29096, creation-date=20081223114930, content-type=image/tiff, version=null}: too many length or distance symbols
java.util.zip.ZipException: too many length or distance symbols
        at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
        at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
        at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
        at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
        at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
        at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:262)
        at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:248)
        at org.archive.io.ArchiveRecord.close(ArchiveRecord.java:172)
        at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:175)
        at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
        at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:501)
        at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
        at io.archivesunleashed.data.ArchiveRecordInputFormat$ArchiveRecordReader.nextKeyValue(ArchiveRecordInputFormat.java:195)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:100)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

@borislin
Copy link
Collaborator

@ruebot HEAD

@ruebot
Copy link
Member Author

ruebot commented Sep 30, 2018

Can you do the same testing on all the files, with HEAD, but use all the files in /home/ruestn/499-issues. That should be a good test of this one and #246 on HEAD. If we get warning on all of them, then things seem to have resolved themselves between the 0.16.0 release and now. And, if that's the case, I'll work on cutting a new release. If things die out, then we need to address those.

@borislin
Copy link
Collaborator

borislin commented Oct 1, 2018

@ruebot I've done the same testing on all the files in /home/ruestn/499-issues. Only file ARCHIVEIT-499-BIMONTHLY-5528-20131008090852109-00757-wbgrp-crawl053.us.archive.org-6443.warc.gz produces an EOFException because this file is empty and we currently do not catch this exception and handle it. All other files produce only warnings.

@ruebot
Copy link
Member Author

ruebot commented Oct 2, 2018

Thanks! I'll carve out some time today, and try and replicate.

@ruebot
Copy link
Member Author

ruebot commented Oct 2, 2018

@borislin can you gist up your output log?

This is what I just ran on my end with all the 499-issues arcs/warcs:

/home/nruest/bin/spark-2.3.1-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log

(I use zsh, so I have to escape those brackets.)

2018-10-02 09:42:43 WARN  Utils:66 - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-02 09:42:43 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-10-02 09:42:43 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538487767122).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
2018-10-02 09:42:49 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:42:49 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:42:49 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 10.0.1.44:42415 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:42:49 INFO  SparkContext:54 - Created broadcast 0 from newAPIHadoopFile at package.scala:51
2018-10-02 09:42:49 INFO  FileInputFormat:283 - Total input paths to process : 54
2018-10-02 09:42:49 INFO  SparkContext:54 - Starting job: sortBy at package.scala:74
2018-10-02 09:42:49 INFO  DAGScheduler:54 - Registering RDD 5 (map at package.scala:72)
2018-10-02 09:42:49 INFO  DAGScheduler:54 - Got job 0 (sortBy at package.scala:74) with 54 output partitions
2018-10-02 09:42:49 INFO  DAGScheduler:54 - Final stage: ResultStage 1 (sortBy at package.scala:74)
2018-10-02 09:42:49 INFO  DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 0)
2018-10-02 09:42:49 INFO  DAGScheduler:54 - Missing parents: List(ShuffleMapStage 0)
2018-10-02 09:42:49 INFO  DAGScheduler:54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at map at package.scala:72), which has no missing parents
2018-10-02 09:42:49 INFO  MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 4.6 KB, free 15.8 GB)
2018-10-02 09:42:49 INFO  MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.6 KB, free 15.8 GB)
2018-10-02 09:42:49 INFO  BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on 10.0.1.44:42415 (size: 2.6 KB, free: 15.8 GB)
2018-10-02 09:42:49 INFO  SparkContext:54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:42:49 INFO  DAGScheduler:54 - Submitting 54 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at map at package.scala:72) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:42:49 INFO  TaskSchedulerImpl:54 - Adding task set 0.0 with 54 tasks
2018-10-02 09:42:49 INFO  TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:42:49 INFO  TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:42:49 INFO  TaskSetManager:54 - Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 7976 bytes)
2018-10-02 09:42:49 INFO  TaskSetManager:54 - Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:42:49 INFO  TaskSetManager:54 - Starting task 4.0 in stage 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:42:49 INFO  TaskSetManager:54 - Starting task 5.0 in stage 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:42:49 INFO  TaskSetManager:54 - Starting task 6.0 in stage 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_LOCAL, 8004 bytes)
2018-10-02 09:42:49 INFO  TaskSetManager:54 - Starting task 7.0 in stage 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:42:49 INFO  TaskSetManager:54 - Starting task 8.0 in stage 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:42:49 INFO  TaskSetManager:54 - Starting task 9.0 in stage 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:42:49 INFO  Executor:54 - Running task 3.0 in stage 0.0 (TID 3)
2018-10-02 09:42:49 INFO  Executor:54 - Running task 0.0 in stage 0.0 (TID 0)
2018-10-02 09:42:49 INFO  Executor:54 - Running task 1.0 in stage 0.0 (TID 1)
2018-10-02 09:42:49 INFO  Executor:54 - Running task 2.0 in stage 0.0 (TID 2)
2018-10-02 09:42:49 INFO  Executor:54 - Running task 5.0 in stage 0.0 (TID 5)
2018-10-02 09:42:49 INFO  Executor:54 - Running task 4.0 in stage 0.0 (TID 4)
2018-10-02 09:42:49 INFO  Executor:54 - Running task 6.0 in stage 0.0 (TID 6)
2018-10-02 09:42:49 INFO  Executor:54 - Running task 7.0 in stage 0.0 (TID 7)
2018-10-02 09:42:49 INFO  Executor:54 - Running task 8.0 in stage 0.0 (TID 8)
2018-10-02 09:42:49 INFO  Executor:54 - Running task 9.0 in stage 0.0 (TID 9)
2018-10-02 09:42:49 INFO  Executor:54 - Fetching spark://10.0.1.44:46245/jars/aut-0.16.1-SNAPSHOT-fatjar.jar with timestamp 1538487767105
2018-10-02 09:42:49 INFO  TransportClientFactory:267 - Successfully created connection to /10.0.1.44:46245 after 16 ms (0 ms spent in bootstraps)
2018-10-02 09:42:49 INFO  Utils:54 - Fetching spark://10.0.1.44:46245/jars/aut-0.16.1-SNAPSHOT-fatjar.jar to /tmp/spark-ea2246da-4062-4046-b819-70932ad9d664/userFiles-81e47cb1-d944-40d7-9021-8e94f7a71310/fetchFileTemp8578759083499317273.tmp
2018-10-02 09:42:51 INFO  Executor:54 - Adding file:/tmp/spark-ea2246da-4062-4046-b819-70932ad9d664/userFiles-81e47cb1-d944-40d7-9021-8e94f7a71310/aut-0.16.1-SNAPSHOT-fatjar.jar to class loader
2018-10-02 09:42:51 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:42:51 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:42:51 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:42:51 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:42:51 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:42:51 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:42:51 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:42:51 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:42:51 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:42:51 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:42:58 ERROR Executor:91 - Exception in task 4.0 in stage 0.0 (TID 4)
java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:42:58 INFO  TaskSetManager:54 - Starting task 10.0 in stage 0.0 (TID 10, localhost, executor driver, partition 10, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:42:58 INFO  Executor:54 - Running task 10.0 in stage 0.0 (TID 10)
2018-10-02 09:42:58 WARN  TaskSetManager:66 - Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

2018-10-02 09:42:58 ERROR TaskSetManager:70 - Task 4 in stage 0.0 failed 1 times; aborting job
2018-10-02 09:42:58 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-BIMONTHLY-RPIOGJ-20120131001727-00013-crawling114.us.archive.org-6681.warc.gz:0+242453280
2018-10-02 09:42:58 INFO  TaskSchedulerImpl:54 - Cancelling stage 0
2018-10-02 09:42:58 INFO  Executor:54 - Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
2018-10-02 09:42:58 INFO  TaskSchedulerImpl:54 - Stage 0 was cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor is trying to kill task 9.0 in stage 0.0 (TID 9), reason: Stage cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor is trying to kill task 5.0 in stage 0.0 (TID 5), reason: Stage cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor is trying to kill task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor is trying to kill task 6.0 in stage 0.0 (TID 6), reason: Stage cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor is trying to kill task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor is trying to kill task 10.0 in stage 0.0 (TID 10), reason: Stage cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor is trying to kill task 7.0 in stage 0.0 (TID 7), reason: Stage cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor is trying to kill task 8.0 in stage 0.0 (TID 8), reason: Stage cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
2018-10-02 09:42:58 INFO  Executor:54 - Executor killed task 9.0 in stage 0.0 (TID 9), reason: Stage cancelled
2018-10-02 09:42:58 WARN  TaskSetManager:66 - Lost task 9.0 in stage 0.0 (TID 9, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:58 WARN  TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:58 INFO  DAGScheduler:54 - ShuffleMapStage 0 (map at package.scala:72) failed in 9.132 s due to Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
2018-10-02 09:42:58 INFO  DAGScheduler:54 - Job 0 failed: sortBy at package.scala:74, took 9.294232 s
2018-10-02 09:42:58 INFO  Executor:54 - Executor killed task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
2018-10-02 09:42:58 WARN  TaskSetManager:66 - Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:58 INFO  Executor:54 - Executor killed task 10.0 in stage 0.0 (TID 10), reason: Stage cancelled
2018-10-02 09:42:58 WARN  TaskSetManager:66 - Lost task 10.0 in stage 0.0 (TID 10, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:58 INFO  Executor:54 - Executor killed task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
2018-10-02 09:42:58 WARN  TaskSetManager:66 - Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:58 INFO  Executor:54 - Executor killed task 5.0 in stage 0.0 (TID 5), reason: Stage cancelled
2018-10-02 09:42:58 WARN  TaskSetManager:66 - Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
  at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:306)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:168)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:148)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:622)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:623)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.sortBy(RDD.scala:620)
  at io.archivesunleashed.package$CountableRDD.countItems(package.scala:74)
  ... 78 elided
Caused by: java.util.zip.ZipException: invalid code lengths set
  at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
  at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
  at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
  at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
  at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
  at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
  at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
  at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
  at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
  at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:42:59 INFO  Executor:54 - Executor killed task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
2018-10-02 09:42:59 WARN  TaskSetManager:66 - Lost task 2.0 in stage 0.0 (TID 2, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:59 INFO  MemoryStore:54 - Block broadcast_2 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:42:59 INFO  MemoryStore:54 - Block broadcast_2_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:42:59 INFO  BlockManagerInfo:54 - Added broadcast_2_piece0 in memory on 10.0.1.44:42415 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:42:59 INFO  SparkContext:54 - Created broadcast 2 from newAPIHadoopFile at package.scala:51
2018-10-02 09:42:59 INFO  deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2018-10-02 09:42:59 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO  FileInputFormat:283 - Total input paths to process : 54
2018-10-02 09:42:59 INFO  SparkContext:54 - Starting job: runJob at SparkHadoopWriter.scala:78
2018-10-02 09:42:59 INFO  DAGScheduler:54 - Got job 1 (runJob at SparkHadoopWriter.scala:78) with 54 output partitions
2018-10-02 09:42:59 INFO  DAGScheduler:54 - Final stage: ResultStage 2 (runJob at SparkHadoopWriter.scala:78)
2018-10-02 09:42:59 INFO  DAGScheduler:54 - Parents of final stage: List()
2018-10-02 09:42:59 INFO  DAGScheduler:54 - Missing parents: List()
2018-10-02 09:42:59 INFO  DAGScheduler:54 - Submitting ResultStage 2 (MapPartitionsRDD[15] at saveAsTextFile at <console>:34), which has no missing parents
2018-10-02 09:42:59 INFO  MemoryStore:54 - Block broadcast_3 stored as values in memory (estimated size 72.3 KB, free 15.8 GB)
2018-10-02 09:42:59 INFO  MemoryStore:54 - Block broadcast_3_piece0 stored as bytes in memory (estimated size 25.9 KB, free 15.8 GB)
2018-10-02 09:42:59 INFO  BlockManagerInfo:54 - Added broadcast_3_piece0 in memory on 10.0.1.44:42415 (size: 25.9 KB, free: 15.8 GB)
2018-10-02 09:42:59 INFO  SparkContext:54 - Created broadcast 3 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:42:59 INFO  DAGScheduler:54 - Submitting 54 missing tasks from ResultStage 2 (MapPartitionsRDD[15] at saveAsTextFile at <console>:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:42:59 INFO  TaskSchedulerImpl:54 - Adding task set 2.0 with 54 tasks
2018-10-02 09:42:59 INFO  TaskSetManager:54 - Starting task 0.0 in stage 2.0 (TID 11, localhost, executor driver, partition 0, PROCESS_LOCAL, 7989 bytes)
2018-10-02 09:42:59 INFO  TaskSetManager:54 - Starting task 1.0 in stage 2.0 (TID 12, localhost, executor driver, partition 1, PROCESS_LOCAL, 8020 bytes)
2018-10-02 09:42:59 INFO  TaskSetManager:54 - Starting task 2.0 in stage 2.0 (TID 13, localhost, executor driver, partition 2, PROCESS_LOCAL, 7987 bytes)
2018-10-02 09:42:59 INFO  TaskSetManager:54 - Starting task 3.0 in stage 2.0 (TID 14, localhost, executor driver, partition 3, PROCESS_LOCAL, 7991 bytes)
2018-10-02 09:42:59 INFO  TaskSetManager:54 - Starting task 4.0 in stage 2.0 (TID 15, localhost, executor driver, partition 4, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:42:59 INFO  TaskSetManager:54 - Starting task 5.0 in stage 2.0 (TID 16, localhost, executor driver, partition 5, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:42:59 INFO  TaskSetManager:54 - Starting task 6.0 in stage 2.0 (TID 17, localhost, executor driver, partition 6, PROCESS_LOCAL, 8015 bytes)
2018-10-02 09:42:59 INFO  Executor:54 - Running task 3.0 in stage 2.0 (TID 14)
2018-10-02 09:42:59 INFO  Executor:54 - Running task 6.0 in stage 2.0 (TID 17)
2018-10-02 09:42:59 INFO  Executor:54 - Running task 5.0 in stage 2.0 (TID 16)
2018-10-02 09:42:59 INFO  Executor:54 - Running task 2.0 in stage 2.0 (TID 13)
2018-10-02 09:42:59 INFO  Executor:54 - Running task 1.0 in stage 2.0 (TID 12)
2018-10-02 09:42:59 INFO  Executor:54 - Running task 4.0 in stage 2.0 (TID 15)
2018-10-02 09:42:59 INFO  Executor:54 - Running task 0.0 in stage 2.0 (TID 11)
2018-10-02 09:42:59 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:42:59 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:42:59 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:42:59 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:42:59 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:42:59 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:42:59 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:42:59 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 17
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 22
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 18
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 20
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 6
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 13
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 7
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 1
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 4
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 5
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned shuffle 0
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 2
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 0
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 12
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 14
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 8
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 19
2018-10-02 09:43:00 INFO  ContextCleaner:54 - Cleaned accumulator 24
2018-10-02 09:43:00 INFO  BlockManagerInfo:54 - Removed broadcast_0_piece0 on 10.0.1.44:42415 in memory (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:43:02 INFO  Executor:54 - Executor killed task 8.0 in stage 0.0 (TID 8), reason: Stage cancelled
2018-10-02 09:43:02 INFO  TaskSetManager:54 - Starting task 7.0 in stage 2.0 (TID 18, localhost, executor driver, partition 7, PROCESS_LOCAL, 7989 bytes)
2018-10-02 09:43:02 WARN  TaskSetManager:66 - Lost task 8.0 in stage 0.0 (TID 8, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:02 INFO  Executor:54 - Running task 7.0 in stage 2.0 (TID 18)
2018-10-02 09:43:02 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:43:02 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:43:04 INFO  Executor:54 - Executor killed task 6.0 in stage 0.0 (TID 6), reason: Stage cancelled
2018-10-02 09:43:04 INFO  TaskSetManager:54 - Starting task 8.0 in stage 2.0 (TID 19, localhost, executor driver, partition 8, PROCESS_LOCAL, 8020 bytes)
2018-10-02 09:43:04 WARN  TaskSetManager:66 - Lost task 6.0 in stage 0.0 (TID 6, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:04 INFO  Executor:54 - Running task 8.0 in stage 2.0 (TID 19)
2018-10-02 09:43:04 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:43:04 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:43:06 INFO  Executor:54 - Executor killed task 7.0 in stage 0.0 (TID 7), reason: Stage cancelled
2018-10-02 09:43:06 INFO  TaskSetManager:54 - Starting task 9.0 in stage 2.0 (TID 20, localhost, executor driver, partition 9, PROCESS_LOCAL, 7991 bytes)
2018-10-02 09:43:06 WARN  TaskSetManager:66 - Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 INFO  Executor:54 - Running task 9.0 in stage 2.0 (TID 20)
2018-10-02 09:43:06 INFO  ContextCleaner:54 - Cleaned accumulator 11
2018-10-02 09:43:06 INFO  ContextCleaner:54 - Cleaned accumulator 15
2018-10-02 09:43:06 INFO  ContextCleaner:54 - Cleaned accumulator 3
2018-10-02 09:43:06 INFO  ContextCleaner:54 - Cleaned accumulator 21
2018-10-02 09:43:06 INFO  ContextCleaner:54 - Cleaned accumulator 16
2018-10-02 09:43:06 INFO  ContextCleaner:54 - Cleaned accumulator 23
2018-10-02 09:43:06 INFO  ContextCleaner:54 - Cleaned accumulator 9
2018-10-02 09:43:06 INFO  ContextCleaner:54 - Cleaned accumulator 10
2018-10-02 09:43:06 INFO  TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 
2018-10-02 09:43:06 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:43:06 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000005_0 aborted.
2018-10-02 09:43:06 ERROR Executor:91 - Exception in task 5.0 in stage 2.0 (TID 16)
org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more
2018-10-02 09:43:06 INFO  TaskSetManager:54 - Starting task 10.0 in stage 2.0 (TID 21, localhost, executor driver, partition 10, PROCESS_LOCAL, 8018 bytes)
2018-10-02 09:43:06 WARN  TaskSetManager:66 - Lost task 5.0 in stage 2.0 (TID 16, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more

2018-10-02 09:43:06 ERROR TaskSetManager:70 - Task 5 in stage 2.0 failed 1 times; aborting job
2018-10-02 09:43:06 INFO  TaskSchedulerImpl:54 - Cancelling stage 2
2018-10-02 09:43:06 INFO  Executor:54 - Running task 10.0 in stage 2.0 (TID 21)
2018-10-02 09:43:06 INFO  TaskSchedulerImpl:54 - Stage 2 was cancelled
2018-10-02 09:43:06 INFO  DAGScheduler:54 - ResultStage 2 (runJob at SparkHadoopWriter.scala:78) failed in 7.273 s due to Job aborted due to stage failure: Task 5 in stage 2.0 failed 1 times, most recent failure: Lost task 5.0 in stage 2.0 (TID 16, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more

Driver stacktrace:
2018-10-02 09:43:06 INFO  DAGScheduler:54 - Job 1 failed: runJob at SparkHadoopWriter.scala:78, took 7.276136 s
2018-10-02 09:43:06 INFO  Executor:54 - Executor is trying to kill task 4.0 in stage 2.0 (TID 15), reason: Stage cancelled
2018-10-02 09:43:06 INFO  Executor:54 - Executor is trying to kill task 1.0 in stage 2.0 (TID 12), reason: Stage cancelled
2018-10-02 09:43:06 INFO  Executor:54 - Executor is trying to kill task 8.0 in stage 2.0 (TID 19), reason: Stage cancelled
2018-10-02 09:43:06 INFO  Executor:54 - Executor is trying to kill task 2.0 in stage 2.0 (TID 13), reason: Stage cancelled
2018-10-02 09:43:06 INFO  Executor:54 - Executor is trying to kill task 9.0 in stage 2.0 (TID 20), reason: Stage cancelled
2018-10-02 09:43:06 INFO  Executor:54 - Executor is trying to kill task 6.0 in stage 2.0 (TID 17), reason: Stage cancelled
2018-10-02 09:43:06 INFO  Executor:54 - Executor is trying to kill task 10.0 in stage 2.0 (TID 21), reason: Stage cancelled
2018-10-02 09:43:06 INFO  Executor:54 - Executor is trying to kill task 7.0 in stage 2.0 (TID 18), reason: Stage cancelled
2018-10-02 09:43:06 INFO  Executor:54 - Executor is trying to kill task 3.0 in stage 2.0 (TID 14), reason: Stage cancelled
2018-10-02 09:43:06 INFO  Executor:54 - Executor is trying to kill task 0.0 in stage 2.0 (TID 11), reason: Stage cancelled
2018-10-02 09:43:06 ERROR SparkHadoopWriter:91 - Aborting job job_20181002094259_0015.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 2.0 failed 1 times, most recent failure: Lost task 5.0 in stage 2.0 (TID 16, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
	at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:34)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:39)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:41)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:43)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:45)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:47)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:49)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:51)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:53)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:55)
	at $line21.$read$$iw$$iw$$iw$$iw.<init>(<console>:57)
	at $line21.$read$$iw$$iw$$iw.<init>(<console>:59)
	at $line21.$read$$iw$$iw.<init>(<console>:61)
	at $line21.$read$$iw.<init>(<console>:63)
	at $line21.$read.<init>(<console>:65)
	at $line21.$read$.<init>(<console>:69)
	at $line21.$read$.<clinit>(<console>)
	at $line21.$eval$.$print$lzycompute(<console>:7)
	at $line21.$eval$.$print(<console>:6)
	at $line21.$eval.$print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
	at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
	at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
	at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
	at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
	at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
	at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5$$anonfun$apply$6.apply(ILoop.scala:427)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5$$anonfun$apply$6.apply(ILoop.scala:423)
	at scala.reflect.io.Streamable$Chars$class.applyReader(Streamable.scala:111)
	at scala.reflect.io.File.applyReader(File.scala:50)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5.apply(ILoop.scala:423)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5.apply(ILoop.scala:423)
	at scala.tools.nsc.interpreter.ILoop.savingReplayStack(ILoop.scala:91)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1.apply(ILoop.scala:422)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1.apply(ILoop.scala:422)
	at scala.tools.nsc.interpreter.ILoop.savingReader(ILoop.scala:96)
	at scala.tools.nsc.interpreter.ILoop.interpretAllFrom(ILoop.scala:421)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$run$3$1.apply(ILoop.scala:577)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$run$3$1.apply(ILoop.scala:576)
	at scala.tools.nsc.interpreter.ILoop.withFile(ILoop.scala:570)
	at scala.tools.nsc.interpreter.ILoop.run$3(ILoop.scala:576)
	at scala.tools.nsc.interpreter.ILoop.loadCommand(ILoop.scala:583)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$standardCommands$8.apply(ILoop.scala:207)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$standardCommands$8.apply(ILoop.scala:207)
	at scala.tools.nsc.interpreter.LoopCommands$LineCmd.apply(LoopCommands.scala:62)
	at scala.tools.nsc.interpreter.ILoop.colonCommand(ILoop.scala:688)
	at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:679)
	at scala.tools.nsc.interpreter.ILoop.loadFiles(ILoop.scala:835)
	at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:111)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
	at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
	at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
	at org.apache.spark.repl.Main$.doMain(Main.scala:76)
	at org.apache.spark.repl.Main$.main(Main.scala:56)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-BIMONTHLY-RPIOGJ-20120131001727-00013-crawling114.us.archive.org-6681.warc.gz:0+242453280
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000000_0 aborted.
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000007_0 aborted.
2018-10-02 09:43:06 INFO  Executor:54 - Executor interrupted and killed task 7.0 in stage 2.0 (TID 18), reason: Stage cancelled
2018-10-02 09:43:06 INFO  Executor:54 - Executor interrupted and killed task 0.0 in stage 2.0 (TID 11), reason: Stage cancelled
2018-10-02 09:43:06 WARN  TaskSetManager:66 - Lost task 7.0 in stage 2.0 (TID 18, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 WARN  TaskSetManager:66 - Lost task 0.0 in stage 2.0 (TID 11, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 INFO  Executor:54 - Executor interrupted and killed task 10.0 in stage 2.0 (TID 21), reason: Stage cancelled
2018-10-02 09:43:06 WARN  TaskSetManager:66 - Lost task 10.0 in stage 2.0 (TID 21, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000004_0
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000004_0 aborted.
2018-10-02 09:43:06 INFO  Executor:54 - Executor interrupted and killed task 4.0 in stage 2.0 (TID 15), reason: Stage cancelled
2018-10-02 09:43:06 WARN  TaskSetManager:66 - Lost task 4.0 in stage 2.0 (TID 15, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000001_0
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000001_0 aborted.
2018-10-02 09:43:06 INFO  Executor:54 - Executor interrupted and killed task 1.0 in stage 2.0 (TID 12), reason: Stage cancelled
2018-10-02 09:43:06 WARN  TaskSetManager:66 - Lost task 1.0 in stage 2.0 (TID 12, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted.
  at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:96)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
  at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)
  at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
  at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)
  ... 78 elided
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 2.0 failed 1 times, most recent failure: Lost task 5.0 in stage 2.0 (TID 16, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
  at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
  ... 106 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows
  at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
  at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
  at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
  at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
  at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
  at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
  at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
  at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
  at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
  at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
  at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
  at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
  at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
  at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
  at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
  at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
  ... 8 more
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000009_0
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000009_0 aborted.
2018-10-02 09:43:06 INFO  Executor:54 - Executor interrupted and killed task 9.0 in stage 2.0 (TID 20), reason: Stage cancelled
2018-10-02 09:43:06 WARN  TaskSetManager:66 - Lost task 9.0 in stage 2.0 (TID 20, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000003_0
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000003_0 aborted.
2018-10-02 09:43:06 INFO  Executor:54 - Executor interrupted and killed task 3.0 in stage 2.0 (TID 14), reason: Stage cancelled
2018-10-02 09:43:06 WARN  TaskSetManager:66 - Lost task 3.0 in stage 2.0 (TID 14, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 INFO  MemoryStore:54 - Block broadcast_4 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:43:06 INFO  MemoryStore:54 - Block broadcast_4_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:43:06 INFO  BlockManagerInfo:54 - Added broadcast_4_piece0 in memory on 10.0.1.44:42415 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:43:06 INFO  SparkContext:54 - Created broadcast 4 from newAPIHadoopFile at package.scala:51
2018-10-02 09:43:06 INFO  FileInputFormat:283 - Total input paths to process : 54
2018-10-02 09:43:06 INFO  SparkContext:54 - Starting job: sortBy at package.scala:74
2018-10-02 09:43:06 INFO  DAGScheduler:54 - Registering RDD 23 (map at package.scala:72)
2018-10-02 09:43:06 INFO  DAGScheduler:54 - Got job 2 (sortBy at package.scala:74) with 54 output partitions
2018-10-02 09:43:06 INFO  DAGScheduler:54 - Final stage: ResultStage 4 (sortBy at package.scala:74)
2018-10-02 09:43:06 INFO  DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 3)
2018-10-02 09:43:06 INFO  DAGScheduler:54 - Missing parents: List(ShuffleMapStage 3)
2018-10-02 09:43:06 INFO  DAGScheduler:54 - Submitting ShuffleMapStage 3 (MapPartitionsRDD[23] at map at package.scala:72), which has no missing parents
2018-10-02 09:43:06 INFO  MemoryStore:54 - Block broadcast_5 stored as values in memory (estimated size 4.7 KB, free 15.8 GB)
2018-10-02 09:43:06 INFO  MemoryStore:54 - Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.4 KB, free 15.8 GB)
2018-10-02 09:43:06 INFO  BlockManagerInfo:54 - Added broadcast_5_piece0 in memory on 10.0.1.44:42415 (size: 2.4 KB, free: 15.8 GB)
2018-10-02 09:43:06 INFO  SparkContext:54 - Created broadcast 5 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:43:06 INFO  DAGScheduler:54 - Submitting 54 missing tasks from ShuffleMapStage 3 (MapPartitionsRDD[23] at map at package.scala:72) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:43:06 INFO  TaskSchedulerImpl:54 - Adding task set 3.0 with 54 tasks
2018-10-02 09:43:06 INFO  TaskSetManager:54 - Starting task 0.0 in stage 3.0 (TID 22, localhost, executor driver, partition 0, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:43:06 INFO  TaskSetManager:54 - Starting task 1.0 in stage 3.0 (TID 23, localhost, executor driver, partition 1, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:43:06 INFO  TaskSetManager:54 - Starting task 2.0 in stage 3.0 (TID 24, localhost, executor driver, partition 2, PROCESS_LOCAL, 7976 bytes)
2018-10-02 09:43:06 INFO  TaskSetManager:54 - Starting task 3.0 in stage 3.0 (TID 25, localhost, executor driver, partition 3, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:43:06 INFO  TaskSetManager:54 - Starting task 4.0 in stage 3.0 (TID 26, localhost, executor driver, partition 4, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:43:06 INFO  TaskSetManager:54 - Starting task 5.0 in stage 3.0 (TID 27, localhost, executor driver, partition 5, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:43:06 INFO  TaskSetManager:54 - Starting task 6.0 in stage 3.0 (TID 28, localhost, executor driver, partition 6, PROCESS_LOCAL, 8004 bytes)
2018-10-02 09:43:06 INFO  Executor:54 - Running task 0.0 in stage 3.0 (TID 22)
2018-10-02 09:43:06 INFO  Executor:54 - Running task 5.0 in stage 3.0 (TID 27)
2018-10-02 09:43:06 INFO  Executor:54 - Running task 2.0 in stage 3.0 (TID 24)
2018-10-02 09:43:06 INFO  Executor:54 - Running task 1.0 in stage 3.0 (TID 23)
2018-10-02 09:43:06 INFO  Executor:54 - Running task 6.0 in stage 3.0 (TID 28)
2018-10-02 09:43:06 INFO  Executor:54 - Running task 3.0 in stage 3.0 (TID 25)
2018-10-02 09:43:06 INFO  Executor:54 - Running task 4.0 in stage 3.0 (TID 26)
2018-10-02 09:43:06 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:43:06 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:43:06 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:43:06 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:43:06 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:43:06 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:43:06 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 47
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 31
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 28
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 26
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 33
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 44
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 34
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 36
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 35
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 43
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 37
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 30
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 49
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 27
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 48
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 32
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 39
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 38
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 29
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 42
2018-10-02 09:43:08 INFO  BlockManagerInfo:54 - Removed broadcast_1_piece0 on 10.0.1.44:42415 in memory (size: 2.6 KB, free: 15.8 GB)
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 40
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 46
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 25
2018-10-02 09:43:08 INFO  BlockManagerInfo:54 - Removed broadcast_2_piece0 on 10.0.1.44:42415 in memory (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 41
2018-10-02 09:43:08 INFO  ContextCleaner:54 - Cleaned accumulator 45
2018-10-02 09:43:11 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:11 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000006_0
2018-10-02 09:43:11 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000006_0 aborted.
2018-10-02 09:43:11 INFO  Executor:54 - Executor interrupted and killed task 6.0 in stage 2.0 (TID 17), reason: Stage cancelled
2018-10-02 09:43:11 INFO  TaskSetManager:54 - Starting task 7.0 in stage 3.0 (TID 29, localhost, executor driver, partition 7, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:43:11 WARN  TaskSetManager:66 - Lost task 6.0 in stage 2.0 (TID 17, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:11 INFO  Executor:54 - Running task 7.0 in stage 3.0 (TID 29)
2018-10-02 09:43:11 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:43:13 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:13 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000002_0
2018-10-02 09:43:13 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000002_0 aborted.
2018-10-02 09:43:13 INFO  Executor:54 - Executor interrupted and killed task 2.0 in stage 2.0 (TID 13), reason: Stage cancelled
2018-10-02 09:43:13 INFO  TaskSetManager:54 - Starting task 8.0 in stage 3.0 (TID 30, localhost, executor driver, partition 8, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:43:13 WARN  TaskSetManager:66 - Lost task 2.0 in stage 2.0 (TID 13, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:13 INFO  Executor:54 - Running task 8.0 in stage 3.0 (TID 30)
2018-10-02 09:43:13 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:43:14 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:14 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000008_0
2018-10-02 09:43:14 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000008_0 aborted.
2018-10-02 09:43:14 INFO  Executor:54 - Executor interrupted and killed task 8.0 in stage 2.0 (TID 19), reason: Stage cancelled
2018-10-02 09:43:14 INFO  TaskSetManager:54 - Starting task 9.0 in stage 3.0 (TID 31, localhost, executor driver, partition 9, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:43:14 WARN  TaskSetManager:66 - Lost task 8.0 in stage 2.0 (TID 19, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:14 INFO  TaskSchedulerImpl:54 - Removed TaskSet 2.0, whose tasks have all completed, from pool 
2018-10-02 09:43:14 INFO  Executor:54 - Running task 9.0 in stage 3.0 (TID 31)
2018-10-02 09:43:14 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:43:15 ERROR Executor:91 - Exception in task 5.0 in stage 3.0 (TID 27)
java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:15 INFO  TaskSetManager:54 - Starting task 10.0 in stage 3.0 (TID 32, localhost, executor driver, partition 10, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:43:15 WARN  TaskSetManager:66 - Lost task 5.0 in stage 3.0 (TID 27, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

2018-10-02 09:43:15 ERROR TaskSetManager:70 - Task 5 in stage 3.0 failed 1 times; aborting job
2018-10-02 09:43:15 INFO  Executor:54 - Running task 10.0 in stage 3.0 (TID 32)
2018-10-02 09:43:15 INFO  TaskSchedulerImpl:54 - Cancelling stage 3
2018-10-02 09:43:15 INFO  TaskSchedulerImpl:54 - Stage 3 was cancelled
2018-10-02 09:43:15 INFO  DAGScheduler:54 - ShuffleMapStage 3 (map at package.scala:72) failed in 8.441 s due to Job aborted due to stage failure: Task 5 in stage 3.0 failed 1 times, most recent failure: Lost task 5.0 in stage 3.0 (TID 27, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
2018-10-02 09:43:15 INFO  Executor:54 - Executor is trying to kill task 8.0 in stage 3.0 (TID 30), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor is trying to kill task 9.0 in stage 3.0 (TID 31), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor is trying to kill task 10.0 in stage 3.0 (TID 32), reason: Stage cancelled
2018-10-02 09:43:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-BIMONTHLY-RPIOGJ-20120131001727-00013-crawling114.us.archive.org-6681.warc.gz:0+242453280
2018-10-02 09:43:15 INFO  Executor:54 - Executor is trying to kill task 2.0 in stage 3.0 (TID 24), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor is trying to kill task 6.0 in stage 3.0 (TID 28), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor is trying to kill task 3.0 in stage 3.0 (TID 25), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor is trying to kill task 0.0 in stage 3.0 (TID 22), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor is trying to kill task 7.0 in stage 3.0 (TID 29), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor is trying to kill task 4.0 in stage 3.0 (TID 26), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor is trying to kill task 1.0 in stage 3.0 (TID 23), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor killed task 10.0 in stage 3.0 (TID 32), reason: Stage cancelled
2018-10-02 09:43:15 INFO  DAGScheduler:54 - Job 2 failed: sortBy at package.scala:74, took 8.449134 s
2018-10-02 09:43:15 WARN  TaskSetManager:66 - Lost task 10.0 in stage 3.0 (TID 32, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 INFO  Executor:54 - Executor killed task 9.0 in stage 3.0 (TID 31), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor killed task 1.0 in stage 3.0 (TID 23), reason: Stage cancelled
2018-10-02 09:43:15 INFO  Executor:54 - Executor killed task 3.0 in stage 3.0 (TID 25), reason: Stage cancelled
2018-10-02 09:43:15 WARN  TaskSetManager:66 - Lost task 1.0 in stage 3.0 (TID 23, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 WARN  TaskSetManager:66 - Lost task 9.0 in stage 3.0 (TID 31, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 WARN  TaskSetManager:66 - Lost task 3.0 in stage 3.0 (TID 25, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 INFO  Executor:54 - Executor killed task 0.0 in stage 3.0 (TID 22), reason: Stage cancelled
2018-10-02 09:43:15 WARN  TaskSetManager:66 - Lost task 0.0 in stage 3.0 (TID 22, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 INFO  Executor:54 - Executor interrupted and killed task 4.0 in stage 3.0 (TID 26), reason: Stage cancelled
2018-10-02 09:43:15 WARN  TaskSetManager:66 - Lost task 4.0 in stage 3.0 (TID 26, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 INFO  Executor:54 - Executor killed task 7.0 in stage 3.0 (TID 29), reason: Stage cancelled
2018-10-02 09:43:15 WARN  TaskSetManager:66 - Lost task 7.0 in stage 3.0 (TID 29, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 3.0 failed 1 times, most recent failure: Lost task 5.0 in stage 3.0 (TID 27, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
  at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:306)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:168)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:148)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:622)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:623)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.sortBy(RDD.scala:620)
  at io.archivesunleashed.package$CountableRDD.countItems(package.scala:74)
  ... 78 elided
Caused by: java.util.zip.ZipException: invalid stored block lengths
  at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
  at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
  at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
  at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
  at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
  at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
  at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
  at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
  at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
  at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
<console>:33: error: not found: value links
       WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
                    ^
2018-10-02 09:43:15 INFO  SparkContext:54 - Invoking stop() from shutdown hook
2018-10-02 09:43:15 INFO  AbstractConnector:318 - Stopped Spark@14b83891{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-10-02 09:43:15 INFO  SparkUI:54 - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-02 09:43:15 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-10-02 09:43:15 INFO  MemoryStore:54 - MemoryStore cleared
2018-10-02 09:43:15 INFO  BlockManager:54 - BlockManager stopped
2018-10-02 09:43:15 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2018-10-02 09:43:15 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-10-02 09:43:15 INFO  SparkContext:54 - Successfully stopped SparkContext
2018-10-02 09:43:15 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-10-02 09:43:15 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-b7418656-a101-4eca-acfd-fe8914c5f1dd
2018-10-02 09:43:15 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-ea2246da-4062-4046-b819-70932ad9d664
2018-10-02 09:43:15 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-ea2246da-4062-4046-b819-70932ad9d664/repl-25ef64b4-c115-4020-a5a3-2e57854c08cc

@ruebot
Copy link
Member Author

ruebot commented Oct 2, 2018

...and if I remove the empty file, and run the same job with the other 53 problematic arcs/warcs, I am not able to replicate what you've come up with (this is all from building aut on HEAD this morning, after clearing ~/.ivy2 and ~/.m2/repository.)

2018-10-02 09:47:09 WARN  Utils:66 - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-02 09:47:09 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-10-02 09:47:09 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538488032542).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
2018-10-02 09:47:14 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:47:14 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:47:14 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 10.0.1.44:40719 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:47:14 INFO  SparkContext:54 - Created broadcast 0 from newAPIHadoopFile at package.scala:51
2018-10-02 09:47:14 INFO  FileInputFormat:283 - Total input paths to process : 53
2018-10-02 09:47:14 INFO  SparkContext:54 - Starting job: sortBy at package.scala:74
2018-10-02 09:47:14 INFO  DAGScheduler:54 - Registering RDD 5 (map at package.scala:72)
2018-10-02 09:47:14 INFO  DAGScheduler:54 - Got job 0 (sortBy at package.scala:74) with 53 output partitions
2018-10-02 09:47:14 INFO  DAGScheduler:54 - Final stage: ResultStage 1 (sortBy at package.scala:74)
2018-10-02 09:47:14 INFO  DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 0)
2018-10-02 09:47:14 INFO  DAGScheduler:54 - Missing parents: List(ShuffleMapStage 0)
2018-10-02 09:47:14 INFO  DAGScheduler:54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at map at package.scala:72), which has no missing parents
2018-10-02 09:47:14 INFO  MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 4.6 KB, free 15.8 GB)
2018-10-02 09:47:14 INFO  MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.6 KB, free 15.8 GB)
2018-10-02 09:47:14 INFO  BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on 10.0.1.44:40719 (size: 2.6 KB, free: 15.8 GB)
2018-10-02 09:47:14 INFO  SparkContext:54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:47:14 INFO  DAGScheduler:54 - Submitting 53 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at map at package.scala:72) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:47:14 INFO  TaskSchedulerImpl:54 - Adding task set 0.0 with 53 tasks
2018-10-02 09:47:14 INFO  TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:47:14 INFO  TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:47:14 INFO  TaskSetManager:54 - Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 7976 bytes)
2018-10-02 09:47:14 INFO  TaskSetManager:54 - Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:47:14 INFO  TaskSetManager:54 - Starting task 4.0 in stage 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:47:14 INFO  TaskSetManager:54 - Starting task 5.0 in stage 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:47:14 INFO  TaskSetManager:54 - Starting task 6.0 in stage 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_LOCAL, 8004 bytes)
2018-10-02 09:47:14 INFO  TaskSetManager:54 - Starting task 7.0 in stage 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:47:14 INFO  TaskSetManager:54 - Starting task 8.0 in stage 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:47:14 INFO  TaskSetManager:54 - Starting task 9.0 in stage 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:47:14 INFO  Executor:54 - Running task 0.0 in stage 0.0 (TID 0)
2018-10-02 09:47:14 INFO  Executor:54 - Running task 1.0 in stage 0.0 (TID 1)
2018-10-02 09:47:14 INFO  Executor:54 - Running task 2.0 in stage 0.0 (TID 2)
2018-10-02 09:47:14 INFO  Executor:54 - Running task 4.0 in stage 0.0 (TID 4)
2018-10-02 09:47:14 INFO  Executor:54 - Running task 8.0 in stage 0.0 (TID 8)
2018-10-02 09:47:14 INFO  Executor:54 - Running task 9.0 in stage 0.0 (TID 9)
2018-10-02 09:47:14 INFO  Executor:54 - Running task 6.0 in stage 0.0 (TID 6)
2018-10-02 09:47:14 INFO  Executor:54 - Running task 7.0 in stage 0.0 (TID 7)
2018-10-02 09:47:14 INFO  Executor:54 - Running task 5.0 in stage 0.0 (TID 5)
2018-10-02 09:47:14 INFO  Executor:54 - Running task 3.0 in stage 0.0 (TID 3)
2018-10-02 09:47:14 INFO  Executor:54 - Fetching spark://10.0.1.44:33593/jars/aut-0.16.1-SNAPSHOT-fatjar.jar with timestamp 1538488032531
2018-10-02 09:47:14 INFO  TransportClientFactory:267 - Successfully created connection to /10.0.1.44:33593 after 15 ms (0 ms spent in bootstraps)
2018-10-02 09:47:14 INFO  Utils:54 - Fetching spark://10.0.1.44:33593/jars/aut-0.16.1-SNAPSHOT-fatjar.jar to /tmp/spark-0782f27c-4461-4384-9191-997eb21b1d7e/userFiles-3f5716ce-710f-4264-aff7-e0b037b9cd99/fetchFileTemp5723790853042569991.tmp
2018-10-02 09:47:15 INFO  Executor:54 - Adding file:/tmp/spark-0782f27c-4461-4384-9191-997eb21b1d7e/userFiles-3f5716ce-710f-4264-aff7-e0b037b9cd99/aut-0.16.1-SNAPSHOT-fatjar.jar to class loader
2018-10-02 09:47:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:47:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:47:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:47:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:47:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:47:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:47:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:47:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:47:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:47:15 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:47:23 ERROR Executor:91 - Exception in task 5.0 in stage 0.0 (TID 5)
java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:23 INFO  TaskSetManager:54 - Starting task 10.0 in stage 0.0 (TID 10, localhost, executor driver, partition 10, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:47:23 WARN  TaskSetManager:66 - Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

2018-10-02 09:47:23 INFO  Executor:54 - Running task 10.0 in stage 0.0 (TID 10)
2018-10-02 09:47:23 ERROR TaskSetManager:70 - Task 5 in stage 0.0 failed 1 times; aborting job
2018-10-02 09:47:23 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-BIMONTHLY-RPIOGJ-20120131001727-00013-crawling114.us.archive.org-6681.warc.gz:0+242453280
2018-10-02 09:47:23 INFO  TaskSchedulerImpl:54 - Cancelling stage 0
2018-10-02 09:47:23 INFO  Executor:54 - Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
2018-10-02 09:47:23 INFO  Executor:54 - Executor is trying to kill task 9.0 in stage 0.0 (TID 9), reason: Stage cancelled
2018-10-02 09:47:23 INFO  TaskSchedulerImpl:54 - Stage 0 was cancelled
2018-10-02 09:47:23 INFO  Executor:54 - Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
2018-10-02 09:47:23 INFO  Executor:54 - Executor is trying to kill task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
2018-10-02 09:47:23 INFO  Executor:54 - Executor is trying to kill task 6.0 in stage 0.0 (TID 6), reason: Stage cancelled
2018-10-02 09:47:23 INFO  Executor:54 - Executor is trying to kill task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
2018-10-02 09:47:23 INFO  Executor:54 - Executor is trying to kill task 10.0 in stage 0.0 (TID 10), reason: Stage cancelled
2018-10-02 09:47:23 INFO  Executor:54 - Executor is trying to kill task 7.0 in stage 0.0 (TID 7), reason: Stage cancelled
2018-10-02 09:47:23 INFO  Executor:54 - Executor is trying to kill task 4.0 in stage 0.0 (TID 4), reason: Stage cancelled
2018-10-02 09:47:23 INFO  Executor:54 - Executor is trying to kill task 8.0 in stage 0.0 (TID 8), reason: Stage cancelled
2018-10-02 09:47:23 INFO  DAGScheduler:54 - ShuffleMapStage 0 (map at package.scala:72) failed in 8.493 s due to Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
2018-10-02 09:47:23 INFO  Executor:54 - Executor killed task 9.0 in stage 0.0 (TID 9), reason: Stage cancelled
2018-10-02 09:47:23 INFO  Executor:54 - Executor killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
2018-10-02 09:47:23 WARN  TaskSetManager:66 - Lost task 9.0 in stage 0.0 (TID 9, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO  DAGScheduler:54 - Job 0 failed: sortBy at package.scala:74, took 8.618676 s
2018-10-02 09:47:23 WARN  TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO  Executor:54 - Executor killed task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
2018-10-02 09:47:23 WARN  TaskSetManager:66 - Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO  Executor:54 - Executor killed task 4.0 in stage 0.0 (TID 4), reason: Stage cancelled
2018-10-02 09:47:23 WARN  TaskSetManager:66 - Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO  Executor:54 - Executor killed task 10.0 in stage 0.0 (TID 10), reason: Stage cancelled
2018-10-02 09:47:23 WARN  TaskSetManager:66 - Lost task 10.0 in stage 0.0 (TID 10, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO  Executor:54 - Executor killed task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
2018-10-02 09:47:23 WARN  TaskSetManager:66 - Lost task 2.0 in stage 0.0 (TID 2, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO  Executor:54 - Executor killed task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
2018-10-02 09:47:23 WARN  TaskSetManager:66 - Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
  at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:306)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:168)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:148)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:622)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:623)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.sortBy(RDD.scala:620)
  at io.archivesunleashed.package$CountableRDD.countItems(package.scala:74)
  ... 78 elided
Caused by: java.util.zip.ZipException: invalid stored block lengths
  at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
  at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
  at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
  at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
  at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
  at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
  at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
  at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
  at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
  at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:23 INFO  MemoryStore:54 - Block broadcast_2 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:47:23 INFO  MemoryStore:54 - Block broadcast_2_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:47:23 INFO  BlockManagerInfo:54 - Added broadcast_2_piece0 in memory on 10.0.1.44:40719 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:47:23 INFO  SparkContext:54 - Created broadcast 2 from newAPIHadoopFile at package.scala:51
2018-10-02 09:47:23 INFO  deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2018-10-02 09:47:23 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO  FileInputFormat:283 - Total input paths to process : 53
2018-10-02 09:47:23 INFO  SparkContext:54 - Starting job: runJob at SparkHadoopWriter.scala:78
2018-10-02 09:47:23 INFO  DAGScheduler:54 - Got job 1 (runJob at SparkHadoopWriter.scala:78) with 53 output partitions
2018-10-02 09:47:23 INFO  DAGScheduler:54 - Final stage: ResultStage 2 (runJob at SparkHadoopWriter.scala:78)
2018-10-02 09:47:23 INFO  DAGScheduler:54 - Parents of final stage: List()
2018-10-02 09:47:23 INFO  DAGScheduler:54 - Missing parents: List()
2018-10-02 09:47:23 INFO  DAGScheduler:54 - Submitting ResultStage 2 (MapPartitionsRDD[15] at saveAsTextFile at <console>:34), which has no missing parents
2018-10-02 09:47:23 INFO  MemoryStore:54 - Block broadcast_3 stored as values in memory (estimated size 72.3 KB, free 15.8 GB)
2018-10-02 09:47:23 INFO  MemoryStore:54 - Block broadcast_3_piece0 stored as bytes in memory (estimated size 25.9 KB, free 15.8 GB)
2018-10-02 09:47:23 INFO  BlockManagerInfo:54 - Added broadcast_3_piece0 in memory on 10.0.1.44:40719 (size: 25.9 KB, free: 15.8 GB)
2018-10-02 09:47:23 INFO  SparkContext:54 - Created broadcast 3 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:47:23 INFO  DAGScheduler:54 - Submitting 53 missing tasks from ResultStage 2 (MapPartitionsRDD[15] at saveAsTextFile at <console>:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:47:23 INFO  TaskSchedulerImpl:54 - Adding task set 2.0 with 53 tasks
2018-10-02 09:47:23 INFO  TaskSetManager:54 - Starting task 0.0 in stage 2.0 (TID 11, localhost, executor driver, partition 0, PROCESS_LOCAL, 7989 bytes)
2018-10-02 09:47:23 INFO  TaskSetManager:54 - Starting task 1.0 in stage 2.0 (TID 12, localhost, executor driver, partition 1, PROCESS_LOCAL, 8020 bytes)
2018-10-02 09:47:23 INFO  TaskSetManager:54 - Starting task 2.0 in stage 2.0 (TID 13, localhost, executor driver, partition 2, PROCESS_LOCAL, 7987 bytes)
2018-10-02 09:47:23 INFO  TaskSetManager:54 - Starting task 3.0 in stage 2.0 (TID 14, localhost, executor driver, partition 3, PROCESS_LOCAL, 7991 bytes)
2018-10-02 09:47:23 INFO  TaskSetManager:54 - Starting task 4.0 in stage 2.0 (TID 15, localhost, executor driver, partition 4, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:47:23 INFO  TaskSetManager:54 - Starting task 5.0 in stage 2.0 (TID 16, localhost, executor driver, partition 5, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:47:23 INFO  TaskSetManager:54 - Starting task 6.0 in stage 2.0 (TID 17, localhost, executor driver, partition 6, PROCESS_LOCAL, 8015 bytes)
2018-10-02 09:47:23 INFO  Executor:54 - Running task 2.0 in stage 2.0 (TID 13)
2018-10-02 09:47:23 INFO  Executor:54 - Running task 3.0 in stage 2.0 (TID 14)
2018-10-02 09:47:23 INFO  Executor:54 - Running task 4.0 in stage 2.0 (TID 15)
2018-10-02 09:47:23 INFO  Executor:54 - Running task 0.0 in stage 2.0 (TID 11)
2018-10-02 09:47:23 INFO  Executor:54 - Running task 1.0 in stage 2.0 (TID 12)
2018-10-02 09:47:23 INFO  Executor:54 - Running task 6.0 in stage 2.0 (TID 17)
2018-10-02 09:47:23 INFO  Executor:54 - Running task 5.0 in stage 2.0 (TID 16)
2018-10-02 09:47:23 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:47:23 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:47:23 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:47:23 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:47:23 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:47:23 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:47:23 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:47:23 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:25 INFO  Executor:54 - Executor killed task 8.0 in stage 0.0 (TID 8), reason: Stage cancelled
2018-10-02 09:47:25 INFO  TaskSetManager:54 - Starting task 7.0 in stage 2.0 (TID 18, localhost, executor driver, partition 7, PROCESS_LOCAL, 7989 bytes)
2018-10-02 09:47:25 WARN  TaskSetManager:66 - Lost task 8.0 in stage 0.0 (TID 8, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:25 INFO  Executor:54 - Running task 7.0 in stage 2.0 (TID 18)
2018-10-02 09:47:25 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:47:25 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:26 INFO  Executor:54 - Executor killed task 6.0 in stage 0.0 (TID 6), reason: Stage cancelled
2018-10-02 09:47:26 INFO  TaskSetManager:54 - Starting task 8.0 in stage 2.0 (TID 19, localhost, executor driver, partition 8, PROCESS_LOCAL, 8020 bytes)
2018-10-02 09:47:26 WARN  TaskSetManager:66 - Lost task 6.0 in stage 0.0 (TID 6, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:26 INFO  Executor:54 - Running task 8.0 in stage 2.0 (TID 19)
2018-10-02 09:47:26 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:47:26 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:28 INFO  Executor:54 - Executor killed task 7.0 in stage 0.0 (TID 7), reason: Stage cancelled
2018-10-02 09:47:28 INFO  TaskSetManager:54 - Starting task 9.0 in stage 2.0 (TID 20, localhost, executor driver, partition 9, PROCESS_LOCAL, 7991 bytes)
2018-10-02 09:47:28 WARN  TaskSetManager:66 - Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:28 INFO  Executor:54 - Running task 9.0 in stage 2.0 (TID 20)
2018-10-02 09:47:28 INFO  TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 
2018-10-02 09:47:28 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:47:28 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 17
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 10
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 18
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 14
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 22
2018-10-02 09:47:29 INFO  BlockManagerInfo:54 - Removed broadcast_1_piece0 on 10.0.1.44:40719 in memory (size: 2.6 KB, free: 15.8 GB)
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 6
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 0
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 19
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 13
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 16
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 11
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 21
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned shuffle 0
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 24
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 15
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 8
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 20
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 7
2018-10-02 09:47:29 INFO  BlockManagerInfo:54 - Removed broadcast_0_piece0 on 10.0.1.44:40719 in memory (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 12
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 9
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 23
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 3
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 4
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 2
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 5
2018-10-02 09:47:29 INFO  ContextCleaner:54 - Cleaned accumulator 1
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000004_0 aborted.
2018-10-02 09:47:30 ERROR Executor:91 - Exception in task 4.0 in stage 2.0 (TID 15)
org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more
2018-10-02 09:47:30 INFO  TaskSetManager:54 - Starting task 10.0 in stage 2.0 (TID 21, localhost, executor driver, partition 10, PROCESS_LOCAL, 8018 bytes)
2018-10-02 09:47:30 WARN  TaskSetManager:66 - Lost task 4.0 in stage 2.0 (TID 15, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more

2018-10-02 09:47:30 ERROR TaskSetManager:70 - Task 4 in stage 2.0 failed 1 times; aborting job
2018-10-02 09:47:30 INFO  Executor:54 - Running task 10.0 in stage 2.0 (TID 21)
2018-10-02 09:47:30 INFO  TaskSchedulerImpl:54 - Cancelling stage 2
2018-10-02 09:47:30 INFO  TaskSchedulerImpl:54 - Stage 2 was cancelled
2018-10-02 09:47:30 INFO  Executor:54 - Executor is trying to kill task 1.0 in stage 2.0 (TID 12), reason: Stage cancelled
2018-10-02 09:47:30 INFO  DAGScheduler:54 - ResultStage 2 (runJob at SparkHadoopWriter.scala:78) failed in 6.467 s due to Job aborted due to stage failure: Task 4 in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 (TID 15, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more

Driver stacktrace:
2018-10-02 09:47:30 INFO  Executor:54 - Executor is trying to kill task 8.0 in stage 2.0 (TID 19), reason: Stage cancelled
2018-10-02 09:47:30 INFO  Executor:54 - Executor is trying to kill task 5.0 in stage 2.0 (TID 16), reason: Stage cancelled
2018-10-02 09:47:30 INFO  Executor:54 - Executor is trying to kill task 2.0 in stage 2.0 (TID 13), reason: Stage cancelled
2018-10-02 09:47:30 INFO  DAGScheduler:54 - Job 1 failed: runJob at SparkHadoopWriter.scala:78, took 6.470853 s
2018-10-02 09:47:30 INFO  Executor:54 - Executor is trying to kill task 9.0 in stage 2.0 (TID 20), reason: Stage cancelled
2018-10-02 09:47:30 ERROR SparkHadoopWriter:91 - Aborting job job_20181002094723_0015.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 (TID 15, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
	at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:34)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:39)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:41)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:43)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:45)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:47)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:49)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:51)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:53)
	at $line21.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:55)
	at $line21.$read$$iw$$iw$$iw$$iw.<init>(<console>:57)
	at $line21.$read$$iw$$iw$$iw.<init>(<console>:59)
	at $line21.$read$$iw$$iw.<init>(<console>:61)
	at $line21.$read$$iw.<init>(<console>:63)
	at $line21.$read.<init>(<console>:65)
	at $line21.$read$.<init>(<console>:69)
	at $line21.$read$.<clinit>(<console>)
	at $line21.$eval$.$print$lzycompute(<console>:7)
	at $line21.$eval$.$print(<console>:6)
	at $line21.$eval.$print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
	at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
	at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
	at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
	at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
	at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
	at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5$$anonfun$apply$6.apply(ILoop.scala:427)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5$$anonfun$apply$6.apply(ILoop.scala:423)
	at scala.reflect.io.Streamable$Chars$class.applyReader(Streamable.scala:111)
	at scala.reflect.io.File.applyReader(File.scala:50)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5.apply(ILoop.scala:423)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5.apply(ILoop.scala:423)
	at scala.tools.nsc.interpreter.ILoop.savingReplayStack(ILoop.scala:91)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1.apply(ILoop.scala:422)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1.apply(ILoop.scala:422)
	at scala.tools.nsc.interpreter.ILoop.savingReader(ILoop.scala:96)
	at scala.tools.nsc.interpreter.ILoop.interpretAllFrom(ILoop.scala:421)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$run$3$1.apply(ILoop.scala:577)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$run$3$1.apply(ILoop.scala:576)
	at scala.tools.nsc.interpreter.ILoop.withFile(ILoop.scala:570)
	at scala.tools.nsc.interpreter.ILoop.run$3(ILoop.scala:576)
	at scala.tools.nsc.interpreter.ILoop.loadCommand(ILoop.scala:583)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$standardCommands$8.apply(ILoop.scala:207)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$standardCommands$8.apply(ILoop.scala:207)
	at scala.tools.nsc.interpreter.LoopCommands$LineCmd.apply(LoopCommands.scala:62)
	at scala.tools.nsc.interpreter.ILoop.colonCommand(ILoop.scala:688)
	at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:679)
	at scala.tools.nsc.interpreter.ILoop.loadFiles(ILoop.scala:835)
	at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:111)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
	at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
	at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
	at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
	at org.apache.spark.repl.Main$.doMain(Main.scala:76)
	at org.apache.spark.repl.Main$.main(Main.scala:56)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more
2018-10-02 09:47:30 INFO  Executor:54 - Executor is trying to kill task 6.0 in stage 2.0 (TID 17), reason: Stage cancelled
2018-10-02 09:47:30 INFO  Executor:54 - Executor is trying to kill task 10.0 in stage 2.0 (TID 21), reason: Stage cancelled
2018-10-02 09:47:30 INFO  Executor:54 - Executor is trying to kill task 7.0 in stage 2.0 (TID 18), reason: Stage cancelled
2018-10-02 09:47:30 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-BIMONTHLY-RPIOGJ-20120131001727-00013-crawling114.us.archive.org-6681.warc.gz:0+242453280
2018-10-02 09:47:30 INFO  Executor:54 - Executor is trying to kill task 3.0 in stage 2.0 (TID 14), reason: Stage cancelled
2018-10-02 09:47:30 INFO  Executor:54 - Executor is trying to kill task 0.0 in stage 2.0 (TID 11), reason: Stage cancelled
2018-10-02 09:47:30 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:30 WARN  FileUtil:187 - Failed to delete file or dir [/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary]: it still exists.
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000007_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000007_0 aborted.
2018-10-02 09:47:30 INFO  Executor:54 - Executor interrupted and killed task 7.0 in stage 2.0 (TID 18), reason: Stage cancelled
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN  TaskSetManager:66 - Lost task 7.0 in stage 2.0 (TID 18, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000000_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000000_0 aborted.
2018-10-02 09:47:30 INFO  Executor:54 - Executor interrupted and killed task 0.0 in stage 2.0 (TID 11), reason: Stage cancelled
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000002_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000002_0 aborted.
2018-10-02 09:47:30 INFO  Executor:54 - Executor interrupted and killed task 2.0 in stage 2.0 (TID 13), reason: Stage cancelled
2018-10-02 09:47:30 WARN  TaskSetManager:66 - Lost task 0.0 in stage 2.0 (TID 11, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN  TaskSetManager:66 - Lost task 2.0 in stage 2.0 (TID 13, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000010_0 aborted.
2018-10-02 09:47:30 INFO  Executor:54 - Executor interrupted and killed task 10.0 in stage 2.0 (TID 21), reason: Stage cancelled
2018-10-02 09:47:30 WARN  TaskSetManager:66 - Lost task 10.0 in stage 2.0 (TID 21, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000009_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000009_0 aborted.
2018-10-02 09:47:30 INFO  Executor:54 - Executor interrupted and killed task 9.0 in stage 2.0 (TID 20), reason: Stage cancelled
2018-10-02 09:47:30 WARN  TaskSetManager:66 - Lost task 9.0 in stage 2.0 (TID 20, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000001_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000001_0 aborted.
2018-10-02 09:47:30 INFO  Executor:54 - Executor interrupted and killed task 1.0 in stage 2.0 (TID 12), reason: Stage cancelled
2018-10-02 09:47:30 WARN  TaskSetManager:66 - Lost task 1.0 in stage 2.0 (TID 12, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
java.util.zip.ZipException: invalid stored block lengths
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000005_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000005_0 aborted.
2018-10-02 09:47:30 INFO  Executor:54 - Executor interrupted and killed task 5.0 in stage 2.0 (TID 16), reason: Stage cancelled
2018-10-02 09:47:30 WARN  TaskSetManager:66 - Lost task 5.0 in stage 2.0 (TID 16, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted.
  at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:96)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
  at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)
  at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
  at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)
  ... 78 elided
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 (TID 15, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	... 8 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
  at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
  ... 106 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows
  at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
  at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
  at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
  at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
  at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
  at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
  at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
  at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
  at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
  at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
  at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
  at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
  at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
  at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
  at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
  at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
  ... 8 more
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000003_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000003_0 aborted.
2018-10-02 09:47:30 INFO  Executor:54 - Executor interrupted and killed task 3.0 in stage 2.0 (TID 14), reason: Stage cancelled
2018-10-02 09:47:30 WARN  TaskSetManager:66 - Lost task 3.0 in stage 2.0 (TID 14, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 INFO  MemoryStore:54 - Block broadcast_4 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:47:30 INFO  MemoryStore:54 - Block broadcast_4_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:47:30 INFO  BlockManagerInfo:54 - Added broadcast_4_piece0 in memory on 10.0.1.44:40719 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:47:30 INFO  SparkContext:54 - Created broadcast 4 from newAPIHadoopFile at package.scala:51
2018-10-02 09:47:30 INFO  FileInputFormat:283 - Total input paths to process : 53
2018-10-02 09:47:30 INFO  SparkContext:54 - Starting job: sortBy at package.scala:74
2018-10-02 09:47:30 INFO  DAGScheduler:54 - Registering RDD 23 (map at package.scala:72)
2018-10-02 09:47:30 INFO  DAGScheduler:54 - Got job 2 (sortBy at package.scala:74) with 53 output partitions
2018-10-02 09:47:30 INFO  DAGScheduler:54 - Final stage: ResultStage 4 (sortBy at package.scala:74)
2018-10-02 09:47:30 INFO  DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 3)
2018-10-02 09:47:30 INFO  DAGScheduler:54 - Missing parents: List(ShuffleMapStage 3)
2018-10-02 09:47:30 INFO  DAGScheduler:54 - Submitting ShuffleMapStage 3 (MapPartitionsRDD[23] at map at package.scala:72), which has no missing parents
2018-10-02 09:47:30 INFO  MemoryStore:54 - Block broadcast_5 stored as values in memory (estimated size 4.7 KB, free 15.8 GB)
2018-10-02 09:47:30 INFO  MemoryStore:54 - Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.4 KB, free 15.8 GB)
2018-10-02 09:47:30 INFO  BlockManagerInfo:54 - Added broadcast_5_piece0 in memory on 10.0.1.44:40719 (size: 2.4 KB, free: 15.8 GB)
2018-10-02 09:47:30 INFO  SparkContext:54 - Created broadcast 5 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:47:30 INFO  DAGScheduler:54 - Submitting 53 missing tasks from ShuffleMapStage 3 (MapPartitionsRDD[23] at map at package.scala:72) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:47:30 INFO  TaskSchedulerImpl:54 - Adding task set 3.0 with 53 tasks
2018-10-02 09:47:30 INFO  TaskSetManager:54 - Starting task 0.0 in stage 3.0 (TID 22, localhost, executor driver, partition 0, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:47:30 INFO  TaskSetManager:54 - Starting task 1.0 in stage 3.0 (TID 23, localhost, executor driver, partition 1, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:47:30 INFO  TaskSetManager:54 - Starting task 2.0 in stage 3.0 (TID 24, localhost, executor driver, partition 2, PROCESS_LOCAL, 7976 bytes)
2018-10-02 09:47:30 INFO  TaskSetManager:54 - Starting task 3.0 in stage 3.0 (TID 25, localhost, executor driver, partition 3, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:47:30 INFO  TaskSetManager:54 - Starting task 4.0 in stage 3.0 (TID 26, localhost, executor driver, partition 4, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:47:30 INFO  TaskSetManager:54 - Starting task 5.0 in stage 3.0 (TID 27, localhost, executor driver, partition 5, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:47:30 INFO  TaskSetManager:54 - Starting task 6.0 in stage 3.0 (TID 28, localhost, executor driver, partition 6, PROCESS_LOCAL, 8004 bytes)
2018-10-02 09:47:30 INFO  TaskSetManager:54 - Starting task 7.0 in stage 3.0 (TID 29, localhost, executor driver, partition 7, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:47:30 INFO  Executor:54 - Running task 1.0 in stage 3.0 (TID 23)
2018-10-02 09:47:30 INFO  Executor:54 - Running task 7.0 in stage 3.0 (TID 29)
2018-10-02 09:47:30 INFO  Executor:54 - Running task 5.0 in stage 3.0 (TID 27)
2018-10-02 09:47:30 INFO  Executor:54 - Running task 6.0 in stage 3.0 (TID 28)
2018-10-02 09:47:30 INFO  Executor:54 - Running task 4.0 in stage 3.0 (TID 26)
2018-10-02 09:47:30 INFO  Executor:54 - Running task 3.0 in stage 3.0 (TID 25)
2018-10-02 09:47:30 INFO  Executor:54 - Running task 2.0 in stage 3.0 (TID 24)
2018-10-02 09:47:30 INFO  Executor:54 - Running task 0.0 in stage 3.0 (TID 22)
2018-10-02 09:47:30 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:47:30 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:47:30 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:47:30 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:47:30 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:47:30 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:47:30 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:47:30 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:47:33 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:33 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000006_0
2018-10-02 09:47:33 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000006_0 aborted.
2018-10-02 09:47:33 INFO  Executor:54 - Executor interrupted and killed task 6.0 in stage 2.0 (TID 17), reason: Stage cancelled
2018-10-02 09:47:33 INFO  TaskSetManager:54 - Starting task 8.0 in stage 3.0 (TID 30, localhost, executor driver, partition 8, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:47:33 WARN  TaskSetManager:66 - Lost task 6.0 in stage 2.0 (TID 17, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:33 INFO  Executor:54 - Running task 8.0 in stage 3.0 (TID 30)
2018-10-02 09:47:33 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:47:34 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:34 WARN  FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000008_0
2018-10-02 09:47:34 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000008_0 aborted.
2018-10-02 09:47:34 INFO  Executor:54 - Executor interrupted and killed task 8.0 in stage 2.0 (TID 19), reason: Stage cancelled
2018-10-02 09:47:34 INFO  TaskSetManager:54 - Starting task 9.0 in stage 3.0 (TID 31, localhost, executor driver, partition 9, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:47:34 INFO  Executor:54 - Running task 9.0 in stage 3.0 (TID 31)
2018-10-02 09:47:34 WARN  TaskSetManager:66 - Lost task 8.0 in stage 2.0 (TID 19, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:34 INFO  TaskSchedulerImpl:54 - Removed TaskSet 2.0, whose tasks have all completed, from pool 
2018-10-02 09:47:34 INFO  NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:47:36 ERROR Executor:91 - Exception in task 4.0 in stage 3.0 (TID 26)
java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:36 INFO  TaskSetManager:54 - Starting task 10.0 in stage 3.0 (TID 32, localhost, executor driver, partition 10, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:47:36 WARN  TaskSetManager:66 - Lost task 4.0 in stage 3.0 (TID 26, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

2018-10-02 09:47:36 ERROR TaskSetManager:70 - Task 4 in stage 3.0 failed 1 times; aborting job
2018-10-02 09:47:36 INFO  TaskSchedulerImpl:54 - Cancelling stage 3
2018-10-02 09:47:36 INFO  Executor:54 - Executor is trying to kill task 8.0 in stage 3.0 (TID 30), reason: Stage cancelled
2018-10-02 09:47:36 INFO  Executor:54 - Executor is trying to kill task 5.0 in stage 3.0 (TID 27), reason: Stage cancelled
2018-10-02 09:47:36 INFO  Executor:54 - Executor is trying to kill task 9.0 in stage 3.0 (TID 31), reason: Stage cancelled
2018-10-02 09:47:36 INFO  Executor:54 - Running task 10.0 in stage 3.0 (TID 32)
2018-10-02 09:47:36 INFO  Executor:54 - Executor is trying to kill task 10.0 in stage 3.0 (TID 32), reason: Stage cancelled
2018-10-02 09:47:36 INFO  Executor:54 - Executor is trying to kill task 2.0 in stage 3.0 (TID 24), reason: Stage cancelled
2018-10-02 09:47:36 INFO  Executor:54 - Executor is trying to kill task 6.0 in stage 3.0 (TID 28), reason: Stage cancelled
2018-10-02 09:47:36 INFO  Executor:54 - Executor is trying to kill task 3.0 in stage 3.0 (TID 25), reason: Stage cancelled
2018-10-02 09:47:36 INFO  TaskSchedulerImpl:54 - Stage 3 was cancelled
2018-10-02 09:47:36 INFO  Executor:54 - Executor is trying to kill task 0.0 in stage 3.0 (TID 22), reason: Stage cancelled
2018-10-02 09:47:36 INFO  DAGScheduler:54 - ShuffleMapStage 3 (map at package.scala:72) failed in 6.424 s due to Job aborted due to stage failure: Task 4 in stage 3.0 failed 1 times, most recent failure: Lost task 4.0 in stage 3.0 (TID 26, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
2018-10-02 09:47:36 INFO  Executor:54 - Executor is trying to kill task 7.0 in stage 3.0 (TID 29), reason: Stage cancelled
2018-10-02 09:47:36 INFO  Executor:54 - Executor is trying to kill task 1.0 in stage 3.0 (TID 23), reason: Stage cancelled
2018-10-02 09:47:36 INFO  Executor:54 - Executor killed task 10.0 in stage 3.0 (TID 32), reason: Stage cancelled
2018-10-02 09:47:36 INFO  DAGScheduler:54 - Job 2 failed: sortBy at package.scala:74, took 6.428172 s
2018-10-02 09:47:36 WARN  TaskSetManager:66 - Lost task 10.0 in stage 3.0 (TID 32, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:36 INFO  Executor:54 - Executor killed task 7.0 in stage 3.0 (TID 29), reason: Stage cancelled
2018-10-02 09:47:36 WARN  TaskSetManager:66 - Lost task 7.0 in stage 3.0 (TID 29, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:36 INFO  Executor:54 - Executor killed task 0.0 in stage 3.0 (TID 22), reason: Stage cancelled
2018-10-02 09:47:36 WARN  TaskSetManager:66 - Lost task 0.0 in stage 3.0 (TID 22, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:37 INFO  Executor:54 - Executor killed task 1.0 in stage 3.0 (TID 23), reason: Stage cancelled
2018-10-02 09:47:37 WARN  TaskSetManager:66 - Lost task 1.0 in stage 3.0 (TID 23, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 3.0 failed 1 times, most recent failure: Lost task 4.0 in stage 3.0 (TID 26, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
  at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:306)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:168)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:148)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:622)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:623)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.sortBy(RDD.scala:620)
  at io.archivesunleashed.package$CountableRDD.countItems(package.scala:74)
  ... 78 elided
Caused by: java.util.zip.ZipException: invalid code lengths set
  at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
  at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
  at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
  at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
  at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
  at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
  at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
  at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
  at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
  at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
<console>:33: error: not found: value links
       WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
                    ^
2018-10-02 09:47:37 INFO  Executor:54 - Executor killed task 2.0 in stage 3.0 (TID 24), reason: Stage cancelled
2018-10-02 09:47:37 WARN  TaskSetManager:66 - Lost task 2.0 in stage 3.0 (TID 24, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:37 INFO  SparkContext:54 - Invoking stop() from shutdown hook
2018-10-02 09:47:37 INFO  Executor:54 - Executor killed task 3.0 in stage 3.0 (TID 25), reason: Stage cancelled
2018-10-02 09:47:37 WARN  TaskSetManager:66 - Lost task 3.0 in stage 3.0 (TID 25, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:37 INFO  AbstractConnector:318 - Stopped Spark@26c2bd0c{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-10-02 09:47:37 INFO  SparkUI:54 - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-02 09:47:37 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-10-02 09:47:37 INFO  MemoryStore:54 - MemoryStore cleared
2018-10-02 09:47:37 INFO  BlockManager:54 - BlockManager stopped
2018-10-02 09:47:37 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2018-10-02 09:47:37 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-10-02 09:47:37 INFO  SparkContext:54 - Successfully stopped SparkContext
2018-10-02 09:47:37 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-10-02 09:47:37 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-4cdcdab3-2015-4789-acbc-c3ef56e6e405
2018-10-02 09:47:37 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-0782f27c-4461-4384-9191-997eb21b1d7e/repl-6a0a3b1f-9a55-4268-a6d8-d76accaa0494
2018-10-02 09:47:37 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-0782f27c-4461-4384-9191-997eb21b1d7e

@ruebot
Copy link
Member Author

ruebot commented Oct 2, 2018

Out of curiosity, I tried it with Apache Spark 2.3.2 (released September 24, 2018), and I'm getting the same thing.

@borislin
Copy link
Collaborator

borislin commented Oct 2, 2018

@ruebot

I've tried again and it works on my end. Here is the log.

@ruebot
Copy link
Member Author

ruebot commented Oct 2, 2018

Can you give all your steps, because I am unable to replicate your success on 0.16.0 and HEAD, and I am sure @ianmilligan1 is as well. It's extremely helpful to share the exact steps when we're verifying something like this. Saying "it works on my end." without steps to replicate isn't too helpful. So, can you please do this on tuna, or whatever machine that you have these files on:

  1. Clean up your environment:
  • Remove everything in ~/.m2 and ~/.ivy2`
  1. Remove aut from where ever you have it.
  2. Clone aut somewhere.
  3. Build aut on master, as of the latest commit: mvn clean install
  4. Create an output directory with sub-directories
  • mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs
  1. Adapt the example script from above:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-domains/output")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-text/output")
val links = RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
sys.exit
  1. Run the command from above with adapted paths with Apache Spark 2.1.3 or Apache Spark 2.3.2:
/home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
  1. Let us know what happened, tell us your steps, and share the output of the log.

@ianmilligan1
Copy link
Member

Thx @ruebot & @borislin – aye, I'm not able to make this work without crashing down. Thanks again for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants