Skip to content

Compresses factor graph binary files with bzip2 #450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 26, 2016

Conversation

netj
Copy link
Contributor

@netj netj commented Jan 5, 2016

when grounding, and decompresses on the fly with bzcat when running the
sampler.

This cuts a 2MB (2300279 bytes) factor graph down to 184kB (184224 bytes)
(one produced by the test with spouse_example/ddlog app) with negligible
impact on runtime (or even faster!).

groundingTime() {
    local log=$1
    tstart=$((sed -n '\@process/grounding/.*/dump@{ p; q; }' | awk '{print $2}') <$log)
    tend=$((sed -n '\@LEARNING EPOCH 0@{ p; q; }' | awk '{print $2}') <$log)
    echo $(date --date="$tend" +%s.%N) - $(date --date="$tstart" +%s.%N) | bc
}

$ # dump from database and load by sampler without compression
$ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/040549.120506000/log.txt
3.363981000

$ # dump from database and load by sampler with compression
$ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/034709.985065000/log.txt
3.319830000

Note that gzip has a 4GB limitation, so bzip2 was used despite its
higher computational cost. xz may be another good candidate to consider.

@netj netj added this to the DeepDive 0.8.1 milestone Jan 5, 2016
@feiranwang
Copy link
Contributor

Maybe we should test it on a larger factor graph (> 1GB) to see how it performs?

@netj
Copy link
Contributor Author

netj commented Jan 7, 2016

Compression certainly has overhead. The question is whether it'll be a bottleneck. I'm trying to ground a larger one by running spouse example on a larger corpus I synthesized. But it revealed mkmimo's lower throughput, and running much slower than expected.

Meanwhile, here're my notes from doing a quick overhead test with several choices: https://gist.github.com/netj/c6f15bb78ff3a52057cb

netj added 3 commits January 14, 2016 00:51
when grounding, and decompresses on the fly with bzcat when running the
sampler.

This cuts a 2MB (2300279 bytes) factor graph down to 184kB (184224 bytes)
(one produced by the test with spouse_example/ddlog app) with negligible
impact on runtime (or even faster!).

```bash
groundingTime() {
    local log=$1
    tstart=$((sed -n '\@process/grounding/.*/dump@{ p; q; }' | awk '{print $2}') <$log)
    tend=$((sed -n '\@learning EPOCH 0@{ p; q; }' | awk '{print $2}') <$log)
    echo $(date --date="$tend" +%s.%N) - $(date --date="$tstart" +%s.%N) | bc
}

$ # dump from database and load by sampler without compression
$ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/040549.120506000/log.txt
3.363981000

$ # dump from database and load by sampler with compression
$ groundingTime test/postgresql/spouse_example/ddlog/run/20151221/034709.985065000/log.txt
3.319830000
```

Note that gzip has a 4GB limitation, so bzip2 was used despite its
higher computational cost.  xz may be another good candidate to consider.
@netj netj force-pushed the compressed-factorgraph-binaries branch from f663669 to f9afd7e Compare January 14, 2016 08:57
@netj
Copy link
Contributor Author

netj commented Jan 25, 2016

Before I forget, I'll drop some numbers I got a while ago for a large factor graph I synthesized by duplicating the corpus for the spouse example (~12GB uncompressed, 199k vars, 16k weights, 337M factors).

LOADED VARIABLES: #199907
         N_QUERY: #139603
         N_EVID : #60304
LOADED WEIGHTS: #16664
LOADED FACTORS: #337742718

Following are rough measurements on raiders6 with 111 processes, only accounting the dumping time and loading time.

uncompressed

  • 11828322038 bytes (~12GiB)
  • 401.224535 secs

pbzip2

  • 197572897 bytes (~191MiB; 59.8x smaller)
  • 420.276131 secs (+19s; +4.7% increase)

bzip2

  • 195875810 bytes (~189MiB; 60.4x smaller)
  • 464.805231 secs (+64s; +16% increase)

Since the full grounding took significantly more time (materializing the factors, weights), I'd say compression overhead is negligible while it's savings on storage footprint and in turn I/O are quite dramatic. The higher-than-usual compression rate (>>10x) is probably due to the regularity in the factor graph data representation. I think we should turn this on by default unless there's a really good counter argument.

@feiranwang
Copy link
Contributor

Seems there's a huge saving in space with negligible overhead! Merging.

feiranwang added a commit that referenced this pull request Jan 26, 2016
…ries

Compresses factor graph binary files with bzip2
@feiranwang feiranwang merged commit 96dab13 into master Jan 26, 2016
@netj netj deleted the compressed-factorgraph-binaries branch January 28, 2016 19:27
@netj netj modified the milestones: DeepDive 0.8.1, DeepDive 0.8.0 Feb 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants