Compresses factor graph binary files with bzip2 #450

netj · 2016-01-05T11:32:16Z

when grounding, and decompresses on the fly with bzcat when running the
sampler.

This cuts a 2MB (2300279 bytes) factor graph down to 184kB (184224 bytes)
(one produced by the test with spouse_example/ddlog app) with negligible
impact on runtime (or even faster!).

groundingTime() {
    local log=$1
    tstart=$((sed -n '\@process/grounding/.*/dump@{ p; q; }' | awk '{print $2}') <$log)
    tend=$((sed -n '\@LEARNING EPOCH 0@{ p; q; }' | awk '{print $2}') <$log)
    echo $(date --date="$tend" +%s.%N) - $(date --date="$tstart" +%s.%N) | bc
}

$ # dump from database and load by sampler without compression
$ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/040549.120506000/log.txt
3.363981000

$ # dump from database and load by sampler with compression
$ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/034709.985065000/log.txt
3.319830000

Note that gzip has a 4GB limitation, so bzip2 was used despite its
higher computational cost. xz may be another good candidate to consider.

feiranwang · 2016-01-05T22:45:03Z

Maybe we should test it on a larger factor graph (> 1GB) to see how it performs?

netj · 2016-01-07T00:39:31Z

Compression certainly has overhead. The question is whether it'll be a bottleneck. I'm trying to ground a larger one by running spouse example on a larger corpus I synthesized. But it revealed mkmimo's lower throughput, and running much slower than expected.

Meanwhile, here're my notes from doing a quick overhead test with several choices: https://gist.github.com/netj/c6f15bb78ff3a52057cb

@learning

when grounding, and decompresses on the fly with bzcat when running the sampler. This cuts a 2MB (2300279 bytes) factor graph down to 184kB (184224 bytes) (one produced by the test with spouse_example/ddlog app) with negligible impact on runtime (or even faster!). ```bash groundingTime() { local log=$1 tstart=$((sed -n '\@process/grounding/.*/dump@{ p; q; }' | awk '{print $2}') <$log) tend=$((sed -n '\@learning EPOCH 0@{ p; q; }' | awk '{print $2}') <$log) echo $(date --date="$tend" +%s.%N) - $(date --date="$tstart" +%s.%N) | bc } $ # dump from database and load by sampler without compression $ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/040549.120506000/log.txt 3.363981000 $ # dump from database and load by sampler with compression $ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/034709.985065000/log.txt 3.319830000 ``` Note that gzip has a 4GB limitation, so bzip2 was used despite its higher computational cost. xz may be another good candidate to consider.

netj · 2016-01-25T16:03:46Z

Before I forget, I'll drop some numbers I got a while ago for a large factor graph I synthesized by duplicating the corpus for the spouse example (~12GB uncompressed, 199k vars, 16k weights, 337M factors).

LOADED VARIABLES: #199907
         N_QUERY: #139603
         N_EVID : #60304
LOADED WEIGHTS: #16664
LOADED FACTORS: #337742718

Following are rough measurements on raiders6 with 111 processes, only accounting the dumping time and loading time.

uncompressed

11828322038 bytes (~12GiB)
401.224535 secs

pbzip2

197572897 bytes (~191MiB; 59.8x smaller)
420.276131 secs (+19s; +4.7% increase)

bzip2

195875810 bytes (~189MiB; 60.4x smaller)
464.805231 secs (+64s; +16% increase)

Since the full grounding took significantly more time (materializing the factors, weights), I'd say compression overhead is negligible while it's savings on storage footprint and in turn I/O are quite dramatic. The higher-than-usual compression rate (>>10x) is probably due to the regularity in the factor graph data representation. I think we should turn this on by default unless there's a really good counter argument.

feiranwang · 2016-01-26T01:57:43Z

Seems there's a huge saving in space with negligible overhead! Merging.

…ries Compresses factor graph binary files with bzip2

netj assigned feiranwang Jan 5, 2016

netj added this to the DeepDive 0.8.1 milestone Jan 5, 2016

netj added the performance label Jan 5, 2016

netj added 3 commits January 14, 2016 00:51

Switches to pbzip2 instead of bzip2

2688aa8

Adds pbzip2 as bundled runtime dependency

f9afd7e

netj force-pushed the compressed-factorgraph-binaries branch from f663669 to f9afd7e Compare January 14, 2016 08:57

feiranwang added a commit that referenced this pull request Jan 26, 2016

Merge pull request #450 from HazyResearch/compressed-factorgraph-bina…

96dab13

…ries Compresses factor graph binary files with bzip2

feiranwang merged commit 96dab13 into master Jan 26, 2016

netj deleted the compressed-factorgraph-binaries branch January 28, 2016 19:27

netj modified the milestones: DeepDive 0.8.1, DeepDive 0.8.0 Feb 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compresses factor graph binary files with bzip2 #450

Compresses factor graph binary files with bzip2 #450

Uh oh!

netj commented Jan 5, 2016

Uh oh!

feiranwang commented Jan 5, 2016

Uh oh!

netj commented Jan 7, 2016

Uh oh!

netj commented Jan 25, 2016

Uh oh!

feiranwang commented Jan 26, 2016

Uh oh!

Uh oh!

Compresses factor graph binary files with bzip2 #450

Compresses factor graph binary files with bzip2 #450

Uh oh!

Conversation

netj commented Jan 5, 2016

Uh oh!

feiranwang commented Jan 5, 2016

Uh oh!

netj commented Jan 7, 2016

Uh oh!

netj commented Jan 25, 2016

uncompressed

pbzip2

bzip2

Uh oh!

feiranwang commented Jan 26, 2016

Uh oh!

Uh oh!