Skip to content
This repository has been archived by the owner on Apr 27, 2018. It is now read-only.

java.lang.OutOfMemoryError: Java heap space #246

Open
dportabella opened this issue Sep 23, 2016 · 6 comments
Open

java.lang.OutOfMemoryError: Java heap space #246

dportabella opened this issue Sep 23, 2016 · 6 comments

Comments

@dportabella
Copy link

I had memory problems running my program,
and I see that I cannot even run this very simple example:

package application

import org.apache.spark._
import org.warcbase.spark.matchbox.RecordLoader
import org.warcbase.spark.rdd.RecordRDD._

object Test {
  def main(args: Array[String]) 
    val in = if (args.length > 0) args(0) else "/data/sample.warc.gz"
    val conf = new SparkConf().setAppName("Test")
    val spark = new SparkContext(conf)

    val r = RecordLoader.loadArchives(in, spark)
      .keepValidPages()
      .count()

    println(s"result: $r")

    spark.stop()
  }
}

I am running this on my local machine:

$ spark-submit --executor-memory 6g --master local[2] --class application.Test target/scala-2.10/test-assembly-0.1-SNAPSHOT.jar /data/sample.warc.gz

and I get a java.lang.OutOfMemoryError: Java heap space

I thought that spark would take care of the memory, swapping to disk when necessary, and run ok as long as there is enough HD space (even if the input file is 1 Petabyte).

Why do I get an OutOfMemoryError?

Does RecordLoader.loadArchives load everything in memory?
How can I solve this problem?

@ianmilligan1
Copy link
Collaborator

Does your application work if you launch it via spark-shell, along lines of on a local machine:

/home/i2millig/spark-1.5.1/bin/spark-shell --driver-memory 6G --jars ~/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar

To launch, and then :paste: your script in.

And does this simple script work (with loadArchives path changed accordingly):

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("warcbase-core/src/test/resources/arc/example.arc.gz", sc)
  .keepValidPages()
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .take(10)

Just curious where the error is.

@dportabella
Copy link
Author

Yes, this works, and gives this result:
r: Array[(String, Int)] = Array((www.archive.org,132), (deadlists.com,2), (www.hideout.com.br,1))

Does RecordLoader.loadArchives load everything in memory?

The example I gave works depending on the input file /data/sample.warc.gz. I'll try to get the simplest input file for which it fails.

@ianmilligan1
Copy link
Collaborator

Yes, my understanding is that loadArchives loads everything in memory – down the road, I think we'd like to explore using CDX files to be a bit more selective (i.e. the ArchiveSpark model).

We've tested on WARC files up to ~ 1.1GB, but not much bigger. Are you using a very big WARC file?

Error traces are always useful, as we're finding lots of weird things in WARC files that can occasionally break warcbase!

@dportabella
Copy link
Author

It seems that it does not depend on the size of the WARC file. I can successfully process CommonCrawl 1Gb files and it works fine. Here it is a sample WARC file of 295Mb, for which loadArchives fails with OutOfMemoryError:

https://www.dropbox.com/s/h7ing7wdgdq1x9u/www.swisslog.com.warc.gz?dl=0

Or you can rebuild this warc archive by:
wget --warc-file=www.swisslog.com --warc-max-size=500M --no-check-certificate --recursive --level=4 --reject pdf,gz,tar,zip,gif,js,css,ico,jpg,jpeg,png,tiff,mp3,mp4,mpg,mpeg,avi,rfa http://www.swisslog.com/

@ianmilligan1
Copy link
Collaborator

We do have problem with large WARC files, which I'll continue in #254.

This is weird, @dportabella – when you rebuild the WARC archive, does it work? Thanks!

@dportabella
Copy link
Author

I didn't understand the point of rebuilding the WARC archive and try again (what would be the insight on the result?).

Anyway, I tried and it failed with the same error.

However, while reading my own description, I noticed that I only used --executor-memory 6g.
I tried again using --driver-memory 6G, and this time the execution succeeded.

The input www.swisslog.com-00000.warc.gz is 265M, and uncompressed is 346M. I tried again with --driver-memory 1G and it failed again with same error: OutOfMemoryError: Java heap space.

How can I know how much driver-memory and executor-memory do I need?

Anyway, it was my mistake that I didn't use --driver-memory. We can close this ticket.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants