-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
Comments
Same thing on text extraction now, at the same point (same file):
|
Thanks for the update @ruebot! Could you put the file on rho- I’d love to poke at it tomorrow afternoon. |
|
Individually, the WARC is fine - have been able to extract domains and plain text from it. Hmm. |
I see in the second fail merge that it's actually a different file: |
@ruebot Could you help put the file on tuna? I'm looking into this issue. |
@borislin I put @ruebot if you have the file handy could you move That said, with As noted above it's similar to #246 we think. |
I'm moving all the problem files over right now. It's taking some time, and I'll let y'all know when I'm done. |
There are 54 files there. They are a mix of files, that include:
|
What are the known files that are causing this ZipException issue? Only |
Sounds right to me. Those are the ones in the error log. There is a link to a gist above too if you want to double check. |
I also can't reproduce the error for For The command I've used is:
|
@ruebot HEAD |
Can you do the same testing on all the files, with HEAD, but use all the files in |
@ruebot I've done the same testing on all the files in |
Thanks! I'll carve out some time today, and try and replicate. |
@borislin can you gist up your output log? This is what I just ran on my end with all the 499-issues arcs/warcs:
(I use zsh, so I have to escape those brackets.)
|
...and if I remove the empty file, and run the same job with the other 53 problematic arcs/warcs, I am not able to replicate what you've come up with (this is all from building aut on HEAD this morning, after clearing
|
Out of curiosity, I tried it with Apache Spark 2.3.2 (released September 24, 2018), and I'm getting the same thing. |
Can you give all your steps, because I am unable to replicate your success on 0.16.0 and HEAD, and I am sure @ianmilligan1 is as well. It's extremely helpful to share the exact steps when we're verifying something like this. Saying "it works on my end." without steps to replicate isn't too helpful. So, can you please do this on tuna, or whatever machine that you have these files on:
|
Describe the bug
Came across this when processing a user's collection on cloud.archivesunleashed.org, using aut-0.16.0. The collection appears to have a couple problematic warcs, which throw this error:
Fuller log output available here.
To Reproduce
The error occurs at this specific point:
Environment information
--packages
/home/ubuntu/aut/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[12] --driver-memory 105G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --packages \"io.archivesunleashed:aut:0.16.0\" -i /data/139/499/60/spark_jobs/499.scala | tee /data/139/499/60/spark_jobs/499.scala.log
Additional context
ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz
on rho or tuna for further testing.ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz: OK
The text was updated successfully, but these errors were encountered: