Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added ReferenceMapping for Genotype, filterByOverlappingRegion for GenotypeRDDFunctions #470

Merged
merged 1 commit into from
Nov 14, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ import org.apache.spark.Logging
import org.apache.spark.rdd.RDD
import org.bdgenomics.adam.models.{
VariantContext,
SequenceRecord
SequenceRecord,
ReferenceRegion
}
import org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator
import org.bdgenomics.adam.rich.RichVariant
Expand Down Expand Up @@ -101,4 +102,12 @@ class GenotypeRDDFunctions(rdd: RDD[Genotype]) extends Serializable with Logging

bySample
}

def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = {
def overlapsQuery(rec: Genotype): Boolean =
rec.getVariant.getContig.getContigName.toString == query.referenceName &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this chain of getters makes me wonder if we should protect against null values? Using Option() and orNull?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general direction we're moving in (e.g., #469) is to require the API caller to provide input with valid mappings.

From a practical standpoint though, variants are meaningless without being anchored to a reference position. If you get to the point where you have variants without mappings attached to them, you've done something wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, since you use the variant three times, you can define it once using...

val variant = Option(rec.getVariant)

And then use pattern matching or a simple conditional to skip genotypes without a variant

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genotypes without a variant are ill formed... IMO, we shouldn't check for that case. If we did want to validate that genotypes had a variant attached, it should only be via an assert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, variants are not nullable in the genotype schema:

record Genotype {
  /**
   The variant called at this site.
   */
  Variant variant;

https://github.com/bigdatagenomics/bdg-formats/pull/39/files changes this, although it is arguable as to whether that is a desirable change (desirable because it makes everything nullable, undesirable because genotypes without variants are meaningless).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you ever want to project a Genotype without the variant? If so, then you need to make it nullable. If not, then we don't need to worry about the null check here as you say.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, then you need to make it nullable. If not, then we don't need to worry about the null check here as you say.

Since it isn't nullable now, no reason to do the check.

Would you ever want to project a Genotype without the variant?

Let's discuss this on the PR that makes the change. Conceivably, you could, but not sure if it makes sense to allow people to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

-Matt

On Sun, Nov 9, 2014 at 9:38 PM, Frank Austin Nothaft <
notifications@github.com> wrote:

In
adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation/VariationRDDFunctions.scala:

@@ -101,4 +102,12 @@ class GenotypeRDDFunctions(rdd: RDD[Genotype]) extends Serializable with Logging

 bySample

}
+

  • def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = {
  • def overlapsQuery(rec: Genotype): Boolean =
  •  rec.getVariant.getContig.getContigName.toString == query.referenceName &&
    

If so, then you need to make it nullable. If not, then we don't need to
worry about the null check here as you say.

Since it isn't nullable now, no reason to do the check.

Would you ever want to project a Genotype without the variant?

Let's discuss this on the PR
https://github.com/bigdatagenomics/bdg-formats/pull/39/files that makes
the change. Conceivably, you could, but not sure if it makes sense to
allow people to.


Reply to this email directly or view it on GitHub
https://github.com/bigdatagenomics/adam/pull/470/files#r20064839.

rec.getVariant.getStart < query.end &&
rec.getVariant.getEnd > query.start
rdd.filter(overlapsQuery)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
package org.bdgenomics.adam.rich

import org.bdgenomics.adam.models.{ ReferenceRegion, ReferenceMapping }
import org.bdgenomics.formats.avro.{ AlignmentRecord, Feature, FlatGenotype }
import org.bdgenomics.formats.avro.{ AlignmentRecord, Feature, FlatGenotype, Genotype }

/**
* A common location in which to drop some ReferenceMapping implementations.
Expand All @@ -31,6 +31,12 @@ object ReferenceMappingContext {
ReferenceRegion(value.getReferenceName.toString, value.getPosition, value.getPosition)
}

implicit object GenotypeReferenceMapping extends ReferenceMapping[Genotype] with Serializable {
override def getReferenceName(value: Genotype): String = value.getVariant.getContig.getContigName.toString
override def getReferenceRegion(value: Genotype): ReferenceRegion =
ReferenceRegion(value.getVariant.getContig.getContigName.toString, value.getVariant.getStart, value.getVariant.getEnd)
}

implicit object AlignmentRecordReferenceMapping extends ReferenceMapping[AlignmentRecord] with Serializable {
override def getReferenceName(value: AlignmentRecord): String = value.getContig.getContigName.toString
override def getReferenceRegion(value: AlignmentRecord): ReferenceRegion = ReferenceRegion(value).orNull
Expand Down