Make TabixReader take a hadoop configuration in the constructor #5033

chrisvittal · 2018-12-21T18:46:47Z

This fixes a bug in import_vcfs as reading the indices and generating
partitions is parallelized.

This fixes a bug in import_vcfs as reading the indices and generating partitions is parallelized.

chrisvittal · 2018-12-21T18:49:40Z

I discovered this when I tried to run a vcf combiner pipeline. To me, this signals that we need better knowledge of where integration tests live and how to add to them.

tpoterba · 2018-12-21T18:51:35Z

what is the bug?

tpoterba · 2018-12-21T18:51:49Z

Ah.

tpoterba · 2018-12-21T18:51:54Z

Can we have a test that catches it?

tpoterba · 2018-12-21T18:52:28Z

Hmm, maybe this would always work in local mode since they're in a shared JVM?

tpoterba · 2018-12-21T18:52:34Z

That's hard to test.

chrisvittal · 2018-12-21T19:10:46Z

Exactly, this always works in local mode. We definitely need more ways to test behavior in the non-local environment, but I have no concrete ideas right now.

danking

Why can't we make the HailContext available on the workers.

let's talk about this for a second

danking · 2018-12-21T19:35:48Z

You can write non-local tests by submitting a python file to the cluster that is started by the CI.

tpoterba · 2018-12-21T19:38:54Z

We should really be running all of our tests on a cluster. We can run Spark in cluster mode on a single machine, that's probably what we should do

danking · 2018-12-21T19:41:14Z

That still leaves open problems caused by true network communication between physically-distant cores. We could package our tests as a test JAR and submit that to the Spark cluster the CI starts.

tpoterba · 2018-12-21T19:43:00Z

What kinds of problems are in that category? I'm not saying they don't exist, but shouldn't it be difficult to cause that kind of problem given our base abstractions?

chrisvittal · 2018-12-21T20:28:15Z

Re: your review @danking

We can make the HailContext available on the workers. As far as I can tell, we don't right now because we would need to serialize all the values of HailContext that aren't serializable, broadcast it, and change get to grab the broadcasted value.

I could do that. It probably wouldn't take me that long, but this change reverts TabixReader to a behavior that it had during development due to Tim's concern that the hadoop configuration is not serializable. We thought the original version would be okay because TabixReader was only ever constructed on the driver. We were wrong, and considering that we intend to use this to read hundreds of thousands of files at a time, the parallelization is probably a good thing. This change fixes the bug I had in a way consistent with much of our codebase, without making larger changes to how we handle HailContext.

See my reply to the discussion

tpoterba · 2018-12-21T20:33:34Z

It's not easy to make HailContext.get work on the workers.

danking · 2018-12-21T20:33:34Z

@tpoterba The only things that come to mind are shuffle issues and shared filesystem bugs. Regardless we already start a cluster and it's not hard to use that existing functionality run a non-local test.

danking

is it hard to write a test in hail-ci-build.sh that triggers this?

tpoterba · 2018-12-21T20:39:11Z

putting a test in hail-ci-build seems like the wrong thing. We should just run all our tests against a cluster-mode spark, no?

chrisvittal · 2018-12-21T20:41:13Z

I believe that even a local cluster (2+ jvms) would be sufficient to reproduce this error. I just have no idea how to configure such a thing.

danking · 2018-12-21T21:45:46Z

I also do not know how to run our tests in cluster-mode, but I know how to add a python file to this repo and submit it to the cluster in hail-ci-build.sh ;)

Also reorganize the cluster tests to be under the python directory, and make it easier to add new scripts.

Before they were just selecting nothing.

chrisvittal · 2018-12-21T22:49:35Z

Alright, let's see how this does.

danking · 2018-12-21T22:54:45Z

hail/hail-ci-build.sh

-
-    time cluster submit ${CLUSTER_NAME} \
-         cluster-vep-check.py
+    for script in python/cluster-tests/**.py; do


danking · 2018-12-21T22:55:01Z

hail/python/cluster-tests/cluster-read-vcfs-check.py

+import json
+import hail as hl
+
+gvcfs = ['gs://hail-ci/gvcfs/HG00096.g.vcf.gz',


did you manually upload these?

Yep, did it earlier.

chrisvittal · 2018-12-21T23:50:35Z

It works! That same script fails if you try to submit it to a cluster running current master.

…-is#5033) * Make TabixReader take a hadoop configuration This fixes a bug in import_vcfs as reading the indices and generating partitions is parallelized. * Add cluster test for import_vcfs Also reorganize the cluster tests to be under the python directory, and make it easier to add new scripts. * Make partitions json actually subset the files Before they were just selecting nothing.

Make TabixReader take a hadoop configuration

70e5026

This fixes a bug in import_vcfs as reading the indices and generating partitions is parallelized.

chrisvittal assigned danking Dec 21, 2018

tpoterba previously approved these changes Dec 21, 2018

View reviewed changes

danking previously requested changes Dec 21, 2018

View reviewed changes

danking suggested changes Dec 21, 2018

View reviewed changes

chrisvittal added 2 commits December 21, 2018 17:23

Add cluster test for import_vcfs

08f0539

Also reorganize the cluster tests to be under the python directory, and make it easier to add new scripts.

Make partitions json actually subset the files

9fd9190

Before they were just selecting nothing.

danking reviewed Dec 21, 2018

View reviewed changes

danking approved these changes Dec 21, 2018

View reviewed changes

danking merged commit b1473a7 into hail-is:master Dec 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make TabixReader take a hadoop configuration in the constructor #5033

Make TabixReader take a hadoop configuration in the constructor #5033

chrisvittal commented Dec 21, 2018

chrisvittal commented Dec 21, 2018

tpoterba commented Dec 21, 2018

tpoterba commented Dec 21, 2018

tpoterba commented Dec 21, 2018

tpoterba commented Dec 21, 2018

tpoterba commented Dec 21, 2018

chrisvittal commented Dec 21, 2018

danking left a comment

danking commented Dec 21, 2018

tpoterba commented Dec 21, 2018

danking commented Dec 21, 2018

tpoterba commented Dec 21, 2018

chrisvittal commented Dec 21, 2018

tpoterba commented Dec 21, 2018

danking commented Dec 21, 2018

danking left a comment

tpoterba commented Dec 21, 2018

chrisvittal commented Dec 21, 2018

danking commented Dec 21, 2018

chrisvittal commented Dec 21, 2018

danking Dec 21, 2018

danking Dec 21, 2018

chrisvittal Dec 21, 2018

chrisvittal commented Dec 21, 2018

Make TabixReader take a hadoop configuration in the constructor #5033

Make TabixReader take a hadoop configuration in the constructor #5033

Conversation

chrisvittal commented Dec 21, 2018

chrisvittal commented Dec 21, 2018

tpoterba commented Dec 21, 2018

tpoterba commented Dec 21, 2018

tpoterba commented Dec 21, 2018

tpoterba commented Dec 21, 2018

tpoterba commented Dec 21, 2018

chrisvittal commented Dec 21, 2018

danking left a comment

Choose a reason for hiding this comment

danking commented Dec 21, 2018

tpoterba commented Dec 21, 2018

danking commented Dec 21, 2018

tpoterba commented Dec 21, 2018

chrisvittal commented Dec 21, 2018

tpoterba commented Dec 21, 2018

danking commented Dec 21, 2018

danking left a comment

Choose a reason for hiding this comment

tpoterba commented Dec 21, 2018

chrisvittal commented Dec 21, 2018

danking commented Dec 21, 2018

chrisvittal commented Dec 21, 2018

danking Dec 21, 2018

Choose a reason for hiding this comment

danking Dec 21, 2018

Choose a reason for hiding this comment

chrisvittal Dec 21, 2018

Choose a reason for hiding this comment

chrisvittal commented Dec 21, 2018