Find genome variants and indels from RNA-seq (workflow)
- Workflow title
- Find genome variants and indels from RNA-seq
- Provider
- geneXplain GmbH
Workflow overview
Description
The workflow is based on a framework to discover genotype variations published by De Pristo et al., Nature Genetics 43:491-498, 2011. The process applied includes initial read mapping, local realignment around indels, base quality score recalibration, SNP discovery and genotyping to find all potential variants.
The first step of the workflow is an alignment of all reads of fastq file using the TopHat2 tool. In the result folder one can see a sub-folder “tmp” which contains all found Deletions, Insertions, Splice junctions and Alignments. They are stored as tracks and can be opened in the genome browser by double-click on each of the tracks. Each short line (arrow in the higher zoom) represents an aligned “read” from the fastq file.
The next step removes duplicates. The purpose is to mitigate the effects of PCR amplification bias introduced during library construction. Two read pairs are considered duplicate if they align to the same genomic position. The resulting MarkDuplikates1.log file is stored in the log folder and the MarkDuplikates1.stat file is stored in the stat folder.
The next step is a local realignment.
The realigned BAM file is used again to remove duplicates (output MarkDuplicates2.log and MarkDuplicates2.stat), because realignment may change genomic positions of read pairs. After this step additional duplicates can be identified. The next step is a recalibration of base quality values. For each base in each read various covariates (such as reported quality score, position in read, dinucleotide, read GC-content) are calculated. Using these values the algorithm builds the model that predicts sequencing errors. Then it applies this model to calculate an empirical base quality score and overwrites the phred quality score currently in the read. Output is a new BAM file (Good.bam).
Parameters
- Input fastq file
- Minimum read segment length
- We recommend set this value to about half the read length because TopHat will work better with multiple segments
- OutputFolder
- Results are here