Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TabixReader take a hadoop configuration in the constructor #5033

Merged
merged 3 commits into from
Dec 21, 2018

Conversation

chrisvittal
Copy link
Collaborator

This fixes a bug in import_vcfs as reading the indices and generating
partitions is parallelized.

This fixes a bug in import_vcfs as reading the indices and generating
partitions is parallelized.
@chrisvittal
Copy link
Collaborator Author

I discovered this when I tried to run a vcf combiner pipeline. To me, this signals that we need better knowledge of where integration tests live and how to add to them.

@tpoterba
Copy link
Contributor

what is the bug?

@tpoterba
Copy link
Contributor

Ah.

@tpoterba
Copy link
Contributor

Can we have a test that catches it?

@tpoterba
Copy link
Contributor

Hmm, maybe this would always work in local mode since they're in a shared JVM?

@tpoterba
Copy link
Contributor

That's hard to test.

@chrisvittal
Copy link
Collaborator Author

Exactly, this always works in local mode. We definitely need more ways to test behavior in the non-local environment, but I have no concrete ideas right now.

tpoterba
tpoterba previously approved these changes Dec 21, 2018
Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we make the HailContext available on the workers.

@danking danking dismissed tpoterba’s stale review December 21, 2018 19:34

let's talk about this for a second

@danking
Copy link
Contributor

danking commented Dec 21, 2018

You can write non-local tests by submitting a python file to the cluster that is started by the CI.

@tpoterba
Copy link
Contributor

We should really be running all of our tests on a cluster. We can run Spark in cluster mode on a single machine, that's probably what we should do

@danking
Copy link
Contributor

danking commented Dec 21, 2018

That still leaves open problems caused by true network communication between physically-distant cores. We could package our tests as a test JAR and submit that to the Spark cluster the CI starts.

@tpoterba
Copy link
Contributor

What kinds of problems are in that category? I'm not saying they don't exist, but shouldn't it be difficult to cause that kind of problem given our base abstractions?

@chrisvittal
Copy link
Collaborator Author

Re: your review @danking

We can make the HailContext available on the workers. As far as I can tell, we don't right now because we would need to serialize all the values of HailContext that aren't serializable, broadcast it, and change get to grab the broadcasted value.

I could do that. It probably wouldn't take me that long, but this change reverts TabixReader to a behavior that it had during development due to Tim's concern that the hadoop configuration is not serializable. We thought the original version would be okay because TabixReader was only ever constructed on the driver. We were wrong, and considering that we intend to use this to read hundreds of thousands of files at a time, the parallelization is probably a good thing. This change fixes the bug I had in a way consistent with much of our codebase, without making larger changes to how we handle HailContext.

@chrisvittal chrisvittal dismissed danking’s stale review December 21, 2018 20:28

See my reply to the discussion

@tpoterba
Copy link
Contributor

It's not easy to make HailContext.get work on the workers.

@danking
Copy link
Contributor

danking commented Dec 21, 2018

@tpoterba The only things that come to mind are shuffle issues and shared filesystem bugs. Regardless we already start a cluster and it's not hard to use that existing functionality run a non-local test.

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it hard to write a test in hail-ci-build.sh that triggers this?

@tpoterba
Copy link
Contributor

putting a test in hail-ci-build seems like the wrong thing. We should just run all our tests against a cluster-mode spark, no?

@chrisvittal
Copy link
Collaborator Author

I believe that even a local cluster (2+ jvms) would be sufficient to reproduce this error. I just have no idea how to configure such a thing.

@danking
Copy link
Contributor

danking commented Dec 21, 2018

I also do not know how to run our tests in cluster-mode, but I know how to add a python file to this repo and submit it to the cluster in hail-ci-build.sh ;)

Also reorganize the cluster tests to be under the python directory, and
make it easier to add new scripts.
Before they were just selecting nothing.
@chrisvittal
Copy link
Collaborator Author

Alright, let's see how this does.


time cluster submit ${CLUSTER_NAME} \
cluster-vep-check.py
for script in python/cluster-tests/**.py; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v. nice.

import json
import hail as hl

gvcfs = ['gs://hail-ci/gvcfs/HG00096.g.vcf.gz',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you manually upload these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, did it earlier.

@chrisvittal
Copy link
Collaborator Author

It works! That same script fails if you try to submit it to a cluster running current master.

@danking danking merged commit b1473a7 into hail-is:master Dec 21, 2018
bcajes pushed a commit to bcajes/hail that referenced this pull request Jan 4, 2019
…-is#5033)

* Make TabixReader take a hadoop configuration

This fixes a bug in import_vcfs as reading the indices and generating
partitions is parallelized.

* Add cluster test for import_vcfs

Also reorganize the cluster tests to be under the python directory, and
make it easier to add new scripts.

* Make partitions json actually subset the files

Before they were just selecting nothing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants