You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Came across this when processing a user's collection on cloud.archivesunleashed.org, using aut-0.16.0. The collection appears to have a couple problematic warcs, which throw this error:
018-07-19 00:48:39,021 [Executor task launch worker for task 5771] INFO NewHadoopRDD - Input split: file:/data/146/625/warcs/ARCHIVEIT-625-20090319153934-00276-crawling04.us.archive.org.arc.gz:0+103342436
2018-07-19 00:48:40,484 [Executor task launch worker for task 5770] ERROR Executor - Exception in task 1922.0 in stage 3.0 (TID 5770)
java.util.zip.ZipException: invalid distance too far back
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-07-19 00:48:40,485 [dispatcher-event-loop-2] INFO TaskSetManager - Starting task 1924.0 in stage 3.0 (TID 5772, localhost, executor driver, partition 1924, PROCESS_LOCAL, 19609 bytes)
2018-07-19 00:48:40,485 [Executor task launch worker for task 5772] INFO Executor - Running task 1924.0 in stage 3.0 (TID 5772)
2018-07-19 00:48:40,486 [task-result-getter-0] WARN TaskSetManager - Lost task 1922.0 in stage 3.0 (TID 5770, localhost, executor driver): java.util.zip.ZipException: invalid distance too far back
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-07-19 00:48:40,486 [task-result-getter-0] ERROR TaskSetManager - Task 1922 in stage 3.0 failed 1 times; aborting job
2018-07-19 00:48:40,486 [dag-scheduler-event-loop] INFO TaskSchedulerImpl - Cancelling stage 3
2018-07-19 00:48:40,487 [dag-scheduler-event-loop] INFO TaskSchedulerImpl - Stage 3 was cancelled
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1924.0 in stage 3.0 (TID 5772)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1921.0 in stage 3.0 (TID 5769)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1914.0 in stage 3.0 (TID 5762)
2018-07-19 00:48:40,487 [dag-scheduler-event-loop] INFO DAGScheduler - ShuffleMapStage 3 (map at package.scala:66) failed in 6445.786 s due to Job aborted due to stage failure: Task 1922 in stage 3.0 failed 1 times, most recent failure: Lost task 1922.0 in stage 3.0 (TID 5770, localhost, executor driver): java.util.zip.ZipException: invalid distance too far back
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1893.0 in stage 3.0 (TID 5741)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1918.0 in stage 3.0 (TID 5766)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1915.0 in stage 3.0 (TID 5763)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1912.0 in stage 3.0 (TID 5760)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1919.0 in stage 3.0 (TID 5767)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1916.0 in stage 3.0 (TID 5764)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1872.0 in stage 3.0 (TID 5720)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1923.0 in stage 3.0 (TID 5771)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO Executor - Executor is trying to kill task 1917.0 in stage 3.0 (TID 5765)
2018-07-19 00:48:40,487 [main] INFO DAGScheduler - Job 2 failed: sortBy at package.scala:68, took 6445.867506 s
2018-07-19 00:48:40,488 [Executor task launch worker for task 5772] INFO NewHadoopRDD - Input split: file:/data/146/625/warcs/ARCHIVEIT-625-20090319170447-00329-crawling04.us.archive.org.warc.gz:0+100053886
2018-07-19 00:48:40,489 [Executor task launch worker for task 5763] INFO Executor - Executor killed task 1915.0 in stage 3.0 (TID 5763)
2018-07-19 00:48:40,490 [task-result-getter-2] WARN TaskSetManager - Lost task 1915.0 in stage 3.0 (TID 5763, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,490 [Executor task launch worker for task 5764] INFO Executor - Executor killed task 1916.0 in stage 3.0 (TID 5764)
2018-07-19 00:48:40,490 [task-result-getter-1] WARN TaskSetManager - Lost task 1916.0 in stage 3.0 (TID 5764, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,491 [Executor task launch worker for task 5765] INFO Executor - Executor killed task 1917.0 in stage 3.0 (TID 5765)
2018-07-19 00:48:40,491 [task-result-getter-3] WARN TaskSetManager - Lost task 1917.0 in stage 3.0 (TID 5765, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,508 [Executor task launch worker for task 5720] INFO Executor - Executor killed task 1872.0 in stage 3.0 (TID 5720)
2018-07-19 00:48:40,509 [task-result-getter-0] WARN TaskSetManager - Lost task 1872.0 in stage 3.0 (TID 5720, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,517 [Executor task launch worker for task 5772] INFO Executor - Executor killed task 1924.0 in stage 3.0 (TID 5772)
2018-07-19 00:48:40,518 [task-result-getter-2] WARN TaskSetManager - Lost task 1924.0 in stage 3.0 (TID 5772, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,532 [Executor task launch worker for task 5771] INFO Executor - Executor killed task 1923.0 in stage 3.0 (TID 5771)
2018-07-19 00:48:40,532 [task-result-getter-1] WARN TaskSetManager - Lost task 1923.0 in stage 3.0 (TID 5771, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,716 [Executor task launch worker for task 5767] INFO Executor - Executor killed task 1919.0 in stage 3.0 (TID 5767)
2018-07-19 00:48:40,720 [task-result-getter-3] WARN TaskSetManager - Lost task 1919.0 in stage 3.0 (TID 5767, localhost, executor driver): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1922 in stage 3.0 failed 1 times, most recent failure: Lost task 1922.0 in stage 3.0 (TID 5770, localhost, executor driver): java.util.zip.ZipException: invalid distance too far back
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:266)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:128)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:619)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:620)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.sortBy(RDD.scala:617)
at io.archivesunleashed.package$CountableRDD.countItems(package.scala:68)
... 77 elided
Caused by: java.util.zip.ZipException: invalid distance too far back
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-07-19 00:48:41,130 [Executor task launch worker for task 5766] INFO Executor - Executor killed task 1918.0 in stage 3.0 (TID 5766)
2018-07-19 00:48:41,131 [task-result-getter-0] WARN TaskSetManager - Lost task 1918.0 in stage 3.0 (TID 5766, localhost, executor driver): TaskKilled (killed intentionally)
<console>:33: error: not found: value links
WriteGraphML(links, "/data/146/625/45/derivatives/gephi/625-gephi.graphml")
^
2018-07-19 00:48:41,622 [Thread-1] INFO SparkContext - Invoking stop() from shutdown hook
2018-07-19 00:48:41,637 [Thread-1] INFO ServerConnector - Stopped Spark@5625daf1{HTTP/1.1}{0.0.0.0:4040}
To Reproduce
Steps to reproduce the behavior (e.g.):
Some more context from GitHub digging: the java.util.zip.ZipException: invalid distance code error was fixed for WarcRecordUtils.java in this commit. Here is the original issue, from back when AUT was Warcbase.
However, we never updated ArcRecordUtils.java to introduce similar error handling for ARC files. It would be great if ArcRecordUtils.java was updated to catch the this error.
Possibly related: #249 removed some try-catch calls checking for IOExceptions, favoring an Option approach. The justification (discussed in #212) was that IOExceptions should be caught in the ArchiveRecord class, instead of trying to manage them inside every string manipulation function.
However, this is a ZipException, so I do not think the problems are the same. (You can see in #249, that I avoided adding Options to the ArchiveRecord class because it would require refactoring.)
Describe the bug
Came across this when processing a user's collection on cloud.archivesunleashed.org, using
aut-0.16.0
. The collection appears to have a couple problematic warcs, which throw this error:To Reproduce
Steps to reproduce the behavior (e.g.):
Expected behavior
I think we should catch this error, log it, and move on.
Additional context
I'll check with the user, and see if it is ok whether to use one of these files in a test.
tag: @lintool @ianmilligan1
The text was updated successfully, but these errors were encountered: