-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make TabixReader take a hadoop configuration in the constructor #5033
Conversation
This fixes a bug in import_vcfs as reading the indices and generating partitions is parallelized.
I discovered this when I tried to run a vcf combiner pipeline. To me, this signals that we need better knowledge of where integration tests live and how to add to them. |
what is the bug? |
Ah. |
Can we have a test that catches it? |
Hmm, maybe this would always work in local mode since they're in a shared JVM? |
That's hard to test. |
Exactly, this always works in local mode. We definitely need more ways to test behavior in the non-local environment, but I have no concrete ideas right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we make the HailContext available on the workers.
let's talk about this for a second
You can write non-local tests by submitting a python file to the cluster that is started by the CI. |
We should really be running all of our tests on a cluster. We can run Spark in cluster mode on a single machine, that's probably what we should do |
That still leaves open problems caused by true network communication between physically-distant cores. We could package our tests as a test JAR and submit that to the Spark cluster the CI starts. |
What kinds of problems are in that category? I'm not saying they don't exist, but shouldn't it be difficult to cause that kind of problem given our base abstractions? |
Re: your review @danking We can make the HailContext available on the workers. As far as I can tell, we don't right now because we would need to serialize all the values of HailContext that aren't serializable, broadcast it, and change get to grab the broadcasted value. I could do that. It probably wouldn't take me that long, but this change reverts TabixReader to a behavior that it had during development due to Tim's concern that the hadoop configuration is not serializable. We thought the original version would be okay because TabixReader was only ever constructed on the driver. We were wrong, and considering that we intend to use this to read hundreds of thousands of files at a time, the parallelization is probably a good thing. This change fixes the bug I had in a way consistent with much of our codebase, without making larger changes to how we handle HailContext. |
It's not easy to make |
@tpoterba The only things that come to mind are shuffle issues and shared filesystem bugs. Regardless we already start a cluster and it's not hard to use that existing functionality run a non-local test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it hard to write a test in hail-ci-build.sh that triggers this?
putting a test in hail-ci-build seems like the wrong thing. We should just run all our tests against a cluster-mode spark, no? |
I believe that even a local cluster (2+ jvms) would be sufficient to reproduce this error. I just have no idea how to configure such a thing. |
I also do not know how to run our tests in cluster-mode, but I know how to add a python file to this repo and submit it to the cluster in hail-ci-build.sh ;) |
Also reorganize the cluster tests to be under the python directory, and make it easier to add new scripts.
Before they were just selecting nothing.
Alright, let's see how this does. |
|
||
time cluster submit ${CLUSTER_NAME} \ | ||
cluster-vep-check.py | ||
for script in python/cluster-tests/**.py; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v. nice.
import json | ||
import hail as hl | ||
|
||
gvcfs = ['gs://hail-ci/gvcfs/HG00096.g.vcf.gz', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you manually upload these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, did it earlier.
It works! That same script fails if you try to submit it to a cluster running current master. |
…-is#5033) * Make TabixReader take a hadoop configuration This fixes a bug in import_vcfs as reading the indices and generating partitions is parallelized. * Add cluster test for import_vcfs Also reorganize the cluster tests to be under the python directory, and make it easier to add new scripts. * Make partitions json actually subset the files Before they were just selecting nothing.
This fixes a bug in import_vcfs as reading the indices and generating
partitions is parallelized.