java.lang.OutOfMemoryError: Java heap space #246

dportabella · 2016-09-23T09:15:51Z

I had memory problems running my program,
and I see that I cannot even run this very simple example:

package application

import org.apache.spark._
import org.warcbase.spark.matchbox.RecordLoader
import org.warcbase.spark.rdd.RecordRDD._

object Test {
  def main(args: Array[String]) 
    val in = if (args.length > 0) args(0) else "/data/sample.warc.gz"
    val conf = new SparkConf().setAppName("Test")
    val spark = new SparkContext(conf)

    val r = RecordLoader.loadArchives(in, spark)
      .keepValidPages()
      .count()

    println(s"result: $r")

    spark.stop()
  }
}

I am running this on my local machine:

$ spark-submit --executor-memory 6g --master local[2] --class application.Test target/scala-2.10/test-assembly-0.1-SNAPSHOT.jar /data/sample.warc.gz

and I get a java.lang.OutOfMemoryError: Java heap space

I thought that spark would take care of the memory, swapping to disk when necessary, and run ok as long as there is enough HD space (even if the input file is 1 Petabyte).

Why do I get an OutOfMemoryError?

Does RecordLoader.loadArchives load everything in memory?
How can I solve this problem?

The text was updated successfully, but these errors were encountered:

ianmilligan1 · 2016-09-23T18:46:17Z

Does your application work if you launch it via spark-shell, along lines of on a local machine:

/home/i2millig/spark-1.5.1/bin/spark-shell --driver-memory 6G --jars ~/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar

To launch, and then :paste: your script in.

And does this simple script work (with loadArchives path changed accordingly):

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("warcbase-core/src/test/resources/arc/example.arc.gz", sc)
  .keepValidPages()
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .take(10)

Just curious where the error is.

dportabella · 2016-09-25T20:27:49Z

Yes, this works, and gives this result:
r: Array[(String, Int)] = Array((www.archive.org,132), (deadlists.com,2), (www.hideout.com.br,1))

Does RecordLoader.loadArchives load everything in memory?

The example I gave works depending on the input file /data/sample.warc.gz. I'll try to get the simplest input file for which it fails.

ianmilligan1 · 2016-09-26T15:21:56Z

Yes, my understanding is that loadArchives loads everything in memory – down the road, I think we'd like to explore using CDX files to be a bit more selective (i.e. the ArchiveSpark model).

We've tested on WARC files up to ~ 1.1GB, but not much bigger. Are you using a very big WARC file?

Error traces are always useful, as we're finding lots of weird things in WARC files that can occasionally break warcbase!

dportabella · 2016-09-27T21:56:20Z

It seems that it does not depend on the size of the WARC file. I can successfully process CommonCrawl 1Gb files and it works fine. Here it is a sample WARC file of 295Mb, for which loadArchives fails with OutOfMemoryError:

https://www.dropbox.com/s/h7ing7wdgdq1x9u/www.swisslog.com.warc.gz?dl=0

Or you can rebuild this warc archive by:
wget --warc-file=www.swisslog.com --warc-max-size=500M --no-check-certificate --recursive --level=4 --reject pdf,gz,tar,zip,gif,js,css,ico,jpg,jpeg,png,tiff,mp3,mp4,mpg,mpeg,avi,rfa http://www.swisslog.com/

ianmilligan1 · 2016-10-11T14:34:41Z

We do have problem with large WARC files, which I'll continue in #254.

This is weird, @dportabella – when you rebuild the WARC archive, does it work? Thanks!

dportabella · 2016-10-12T13:14:51Z

I didn't understand the point of rebuilding the WARC archive and try again (what would be the insight on the result?).

Anyway, I tried and it failed with the same error.

However, while reading my own description, I noticed that I only used --executor-memory 6g.
I tried again using --driver-memory 6G, and this time the execution succeeded.

The input www.swisslog.com-00000.warc.gz is 265M, and uncompressed is 346M. I tried again with --driver-memory 1G and it failed again with same error: OutOfMemoryError: Java heap space.

How can I know how much driver-memory and executor-memory do I need?

Anyway, it was my mistake that I didn't use --driver-memory. We can close this ticket.

ianmilligan1 mentioned this issue Oct 11, 2016

Memory Issues on Large WARC Files #254

Open

ianmilligan1 mentioned this issue Jan 4, 2018

There is insufficient memory for the Java Runtime Environment to continue archivesunleashed/aut#159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.OutOfMemoryError: Java heap space #246

java.lang.OutOfMemoryError: Java heap space #246

dportabella commented Sep 23, 2016

ianmilligan1 commented Sep 23, 2016

dportabella commented Sep 25, 2016

ianmilligan1 commented Sep 26, 2016

dportabella commented Sep 27, 2016

ianmilligan1 commented Oct 11, 2016

dportabella commented Oct 12, 2016

java.lang.OutOfMemoryError: Java heap space #246

java.lang.OutOfMemoryError: Java heap space #246

Comments

dportabella commented Sep 23, 2016

ianmilligan1 commented Sep 23, 2016

dportabella commented Sep 25, 2016

ianmilligan1 commented Sep 26, 2016

dportabella commented Sep 27, 2016

ianmilligan1 commented Oct 11, 2016

dportabella commented Oct 12, 2016