- Size: 951 KB
- Uploaded: 2019-05-17 15:35:46
- Status: Successfully converted
Variant calling in NGS experiments Jorge Jiménez firstname.lastname@example.org BIER CIBERER Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain) 1 Index 1. NGS workflow 2. Variant calling 3. Methods for calling 4. SNV and indel calling 5. VCF format 6. Missing values 7. Annotation 8. Databases 2 NGS Sequence preprocessing Where we are? Sequence preprocessing Mapping NGS pipeline Variant calling Downstream analysis 3 What is variant calling? Finding A Needle In The Haystack? 4 Variant types SNV: Single nucleotide variant. Indel: small insertion/deletion variant. Reference A SNV G/G Small indel ATG/A 5 Genotype and variant calling – concepts Phred Quality score: A score of 20 corresponds to 1% error rate in base calling Variant calling: positions with at least one of the bases differs from reference. Genotype calling: Process of determining the genotype of each variant. Importance of base quality recalibration: Obtaining well-calibrated quality scores is important, as SNP and genotype calling at a specific position in the genome depends on both the base calls and the per-base quality scores of the reads overlapping the position. Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from 6 next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. Review. PubMed PMID: 21587300. Methods for calling Early methods: Counting the number of times each allele is observed. Probabilistic methods: They compute genotype likelihood. Advantages: - Provide statistical measures of uncertainty. - Lead to higher accuracy of genotype calling. - Provide a natural framework for incorporating information: AF, LD. 7 Calling algorithms Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from 8 next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. Review. PubMed PMID: 21587300. Why GATK? - Probabilistic method: Bayesian estimation of the most likely genotype. - Calculates many parameters for each position of the genome. - SNP and indel calling. - Used in many NGS projects, including the 1000 Genomes Project, The Cancer Genome Atlas, etc. - Base quality recalibration. - Uses standard input and output files. - Many tools for manage VCF files. 9 Indel calling - Many available softwares like dindel, samtools, frebayes, ... - Sequence aligners are often unable to perfectly map reads containing insertions or deletions. - Indel‐containing reads can be either less unmapped or arranged in gapless alignments. - Mismatches in a particular read can interfere with the gap. - Indel detection becomes diﬃcult with so many missing reads. - Artifacts introduced by the gapless alignments cause the appearance of false positive SNPs (usually in clusters) → Local realignment GATK 10 Local realignment Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome Reduces erroneous SNPs refines location of INDELS DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889 11 Calling all bases: missing values We want to know the genotype of all the bases of the exome. Calling: - SNVs + all sites of capture kit - indels Two types of missing values: No coverage: ./. Not sequenced base Filtered: -/- Low quality base We do not know the genotype of these bases 12 Missing values ? ? ?