Description
Follow up to #2445.
The default sample names in our VCF mapping are always tsk_0
, tsk_1
etc, and these may not have anything to do with the IDs of the individuals themselves. We should update the VCF output inorder to provide information about the sample IDs corresponding to each VCF sample, and (if relevant) the individual ID. This will make it much easier to map information from the VCF back into the tskit data model.
VCF version 4.3 declares a "Sample field format" (section 1.4.8). Complete definition included here, for easy reference:
It is possible to define sample to genome mappings as shown below:
##META=<ID=Assay,Type=String,Number=.,Values=[WholeGenome, Exome]>
##META=<ID=Disease,Type=String,Number=.,Values=[None, Cancer]>
##META=<ID=Ethnicity,Type=String,Number=.,Values=[AFR, CEU, ASN, MEX]>
##META=<ID=Tissue,Type=String,Number=.,Values=[Blood, Breast, Colon, Lung, ?]>
##SAMPLE=<ID=Sample1,Assay=WholeGenome,Ethnicity=AFR,Disease=None,Description="Patient germline genome from unaffected",DOI=url>
##SAMPLE=<ID=Sample2,Assay=Exome,Ethnicity=CEU,Disease=Cancer,Tissue=Breast,Description="European patient exome from breast cancer">
So, we could do something like (syntax not quite right for NodeIds I think):
##META=<ID=NodeId,Type=Number,Number=.>
##META=<ID=IndivididualId,Type=Number,Number=1>
##SAMPLE=<ID=tsk_0,NodeIds=0,1,IndividualsId=0>
So, if we're in the "no individual data" case, then we don't include an individuals ID. It's probably handy to always include the node IDs to make it easier to backtrack this information.
I guess this information will be quite large sometimes, so it's probably worth providing an option to suppress it (sample_header_info=False
, I guess?)