Skip to content

Add an option to 'detect and correct' VCF header mismatch errors #101

Closed
@arostamianfar

Description

@arostamianfar

It is surprisingly common for VCF files to have missing header fields (see Issue #61 for a 1000 genomes file). The pipeline currently fails as it is unable to map the missing field(s) to the BigQuery column(s). Note that we need to generate the schema prior to running the pipeline, which is why we need to rely on complete and valid headers.

The current workaround is to manually specify the missing header fields through --representative_header_file, but this is painful and not scalable.

An alternative approach is to do two passes on the data: the first one parses all of the fields in all records just to find the missing headers and the 2nd one actually processes the data. We'd use the data from the first pass to generate the BigQuery schema. Of course, this adds additional computation (roughly 30% more), so we should provide this as an optional feature for "robust" imports. As an optimization, we could provide a sampling logic if the data is expected to be uniform (e.g. sample 30% of the data and assume that everything else follows that schema).

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions