Find genome variants and indels from RNA-seq (workflow)

From BioUML platform
Revision as of 16:34, 12 March 2019 by BioUML wiki Bot (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Workflow title
Find genome variants and indels from RNA-seq
Provider
geneXplain GmbH

Workflow overview

Find-genome-variants-and-indels-from-RNA-seq-workflow-overview.png

Description

The workflow is based on a framework to discover genotype variations published by De Pristo et al., Nature Genetics 43:491-498, 2011. The process applied includes initial read mapping, local realignment around indels, base quality score recalibration, SNP discovery and genotyping to find all potential variants.

The first step of the workflow is an alignment of all reads of fastq file using the TopHat2 tool. In the result folder one can see a sub-folder “tmp” which contains all found Deletions, Insertions, Splice junctions and Alignments. They are stored as tracks and can be opened in the genome browser by double-click on each of the tracks. Each short line (arrow in the higher zoom) represents an aligned “read” from the fastq file.

The next step removes duplicates. The purpose is to mitigate the effects of PCR amplification bias introduced during library construction. Two read pairs are considered duplicate if they align to the same genomic position. The resulting MarkDuplikates1.log file is stored in the log folder and the MarkDuplikates1.stat file is stored in the stat folder.

The next step is a local realignment.

The realigned BAM file is used again to remove duplicates (output MarkDuplicates2.log and MarkDuplicates2.stat), because realignment may change genomic positions of read pairs. After this step additional duplicates can be identified. The next step is a recalibration of base quality values. For each base in each read various covariates (such as reported quality score, position in read, dinucleotide, read GC-content) are calculated. Using these values the algorithm builds the model that predicts sequencing errors. Then it applies this model to calculate an empirical base quality score and overwrites the phred quality score currently in the read. Output is a new BAM file (Good.bam).

 

Parameters

Input fastq file
Minimum read segment length
We recommend set this value to about half the read length because TopHat will work better with multiple segments
OutputFolder
Results are here
Personal tools
Namespaces

Variants
Actions
BioUML platform
Community
Modelling
Analysis & Workflows
Collaborative research
Development
Virtual biology
Wiki
Toolbox