Add an option to 'detect and correct' VCF header mismatch errors

It is surprisingly common for VCF files to have missing header fields (see Issue #61 for a 1000 genomes file). The pipeline currently fails as it is unable to map the missing field(s) to the BigQuery column(s). Note that we need to generate the schema _prior_ to running the pipeline, which is why we need to rely on complete and valid headers.

The current workaround is to manually specify the missing header fields through `--representative_header_file`, but this is painful and not scalable.

An alternative approach is to do two passes on the data: the first one parses all of the fields in all records just to find the missing headers and the 2nd one actually processes the data. We'd use the data from the first pass to generate the BigQuery schema. Of course, this adds additional computation (roughly 30% more), so we should provide this as an optional feature for "robust" imports. As an optimization, we could provide a sampling logic if the data is expected to be uniform (e.g. sample 30% of the data and assume that everything else follows that schema).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to 'detect and correct' VCF header mismatch errors #101

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add an option to 'detect and correct' VCF header mismatch errors #101

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions