-
Notifications
You must be signed in to change notification settings - Fork 9
Create_HC_Subset
The Create_HC_Subset handler creates a single variant call format (VCF) file that contains only the high-confidence (HC) sites for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. The process below can be run as many times as necessary to identify a set of suitable high confidence variants, just re-run this handler. For each step, if there are differences between handling VCFs called using GATK 3.8 vs. GATK 4, those will be noted accordingly. The steps are as follows:
GATK 4: VCF files output from Genotype_GVCFs will likely be split into multiple parts (regions). First, concatenate these split VCFs into a single raw vcf file using bcftools. Since this step only needs to be run once and is time-consuming for large datasets, the handler checks if this file already exists and proceeds to the next step if it does exist. Note: the handler only checks if the filename exists, so if you have a truncated file please delete it before re-running the handler.
GATK 3:
- The genomic part VCF files from Genotype_GVCFs are gzipped, while preserving the original files.
- The gzipped files are concatenated using bcftools into a single VCF file.
- If the data is exome capture, the sites outside the exome capture region are filtered out using vcflib. If not, then nothing happens.
Insertions and deletions (indels) are filtered out using VCFtools. This step also only needs to be run once and is time-consuming for large datasets. The handler will check if this file already exists and proceed to the next step if it exists.
Percentile tables of DP per sample and GQ are generated for the "raw" VCF file. Again, this only needs to be run once and the handler checks if the file exists already.
A custom python3 script is run to filter out low-confidence sites based on the user parameters defined in the configuration file.
The filtering script does the following, based on the user parameters defined in the configuration file:
- Filters out indels and sites with more than two alleles
- If the quality score is missing or the site quality score is too low, filters out the site
- If too many samples are heterozygous, filters out the site
- If too many samples are "bad" (missing, low quality, or low depth), filters out the site
Percentile tables of DP per sample and GQ are generated for the filtered VCF file.
If the organism is barley, the parts positions are converted into pseudomolecular positions using a custom python3 script. If not, then nothing happens.
Use vcftools to remove sites that aren't polymorphic (minor allele count of 0).
If you would like to explore your VCF files to help decide which filtering criteria to use, we have a Jupyter Notebook template that can be modified for this purpose. It is located in the HelperScripts
subdirectory and is called VCF_Exploration_and_Filtering_Notebook.ipynb
. If you decide it is appropriate for your application, you can also use the notebook to filter your VCF in place of Create_HC_Subset.
The VCF_Exploration_and_Filtering_Notebook.ipynb
requires your VCF file be in hdf5 format. In HelperScripts
there is a set of scripts you can use to do this VCF to hdf5 format conversion. If you have a small VCF file, you can directly run the vcf_to_h5.py
script interactively. If you need longer runtimes, the run_vcf_to_h5.sh
script calls on the vcf_to_h5.py
script and can be submitted as a job to run on MSI.
To run Create_HC_Subset, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Create_HC_Subset can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling
):
./sequence_handling Create_HC_Subset Config
Where Config is the full file path to the configuration file.
The following are a list of variables that need to be defined within Config
. In addition to the handler-specific variables, all common variables must be defined.
Variable | Function |
---|---|
CHS_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
CHS_VCF_LIST |
A list of full file paths to the chromosome part VCF files from Genotype_GVCFs. This can be generated with sample_list_generator.sh . |
CAPTURE_REGIONS |
The full file path to the capture regions file in BED format. This should be the same file as the REGIONS_FILE in Coverage_Mapping. If not exome capture, put "NA" . |
CHS_DP_PER_SAMPLE_CUTOFF |
The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5 |
CHS_GQ_CUTOFF |
The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Create_HC_Subset multiple times. |
CHS_MAX_BAD |
The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number) |
CHS_MAX_HET |
The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number) |
CHS_QUAL_CUTOFF |
The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40 |
Create_HC_Subset generates a VCF file containing only the high-confidence variants. The VCF file can be found at
${OUT_DIR}/Create_HC_Subset/${PROJECT}_high_confidence_subset.vcf
Create_HC_Subset depends on VCFtools and vcflib for manipulating the VCF file. Create_HC_Subset also calls on helper scripts that require R, GNU Parallel, and Python3. In addition, PBS is required for basic operation. Please check the dependencies page to ensure that you are using the required version of each dependency.
Next: Variant_Recalibrator
- Getting Started
- Recommended Workflow
- Configuration
- Dependencies
- sample_list_generator.sh
- Slurm specific options
- Common Problems and Errors