-
Notifications
You must be signed in to change notification settings - Fork 47
java.lang.OutOfMemoryError: Java heap space #246
Comments
Does your application work if you launch it via /home/i2millig/spark-1.5.1/bin/spark-shell --driver-memory 6G --jars ~/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar To launch, and then And does this simple script work (with import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("warcbase-core/src/test/resources/arc/example.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10) Just curious where the error is. |
Yes, this works, and gives this result: Does The example I gave works depending on the input file |
Yes, my understanding is that We've tested on WARC files up to ~ 1.1GB, but not much bigger. Are you using a very big WARC file? Error traces are always useful, as we're finding lots of weird things in WARC files that can occasionally break warcbase! |
It seems that it does not depend on the size of the WARC file. I can successfully process CommonCrawl 1Gb files and it works fine. Here it is a sample WARC file of 295Mb, for which https://www.dropbox.com/s/h7ing7wdgdq1x9u/www.swisslog.com.warc.gz?dl=0 Or you can rebuild this warc archive by: |
We do have problem with large WARC files, which I'll continue in #254. This is weird, @dportabella – when you rebuild the WARC archive, does it work? Thanks! |
I didn't understand the point of rebuilding the WARC archive and try again (what would be the insight on the result?). Anyway, I tried and it failed with the same error. However, while reading my own description, I noticed that I only used The input How can I know how much Anyway, it was my mistake that I didn't use |
I had memory problems running my program,
and I see that I cannot even run this very simple example:
I am running this on my local machine:
$ spark-submit --executor-memory 6g --master local[2] --class application.Test target/scala-2.10/test-assembly-0.1-SNAPSHOT.jar /data/sample.warc.gz
and I get a
java.lang.OutOfMemoryError: Java heap space
I thought that spark would take care of the memory, swapping to disk when necessary, and run ok as long as there is enough HD space (even if the input file is 1 Petabyte).
Why do I get an
OutOfMemoryError
?Does
RecordLoader.loadArchives
load everything in memory?How can I solve this problem?
The text was updated successfully, but these errors were encountered: