Difference between revisions of "Fastq mcf"

From BioUML platform
Jump to: navigation, search
(Introduction)
 
(5 intermediate revisions by one user not shown)
Line 12: Line 12:
 
==Usage==
 
==Usage==
 
<code>
 
<code>
Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
+
Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]<br>
Version: 1.04.636
+
</code>
 
+
 
Detects levels of adapter presence, computes likelihoods and
 
Detects levels of adapter presence, computes likelihoods and
 
locations (start, end) of the adapters.  Removes the adapter
 
locations (start, end) of the adapters.  Removes the adapter
 
sequences from the fastq file(s).
 
sequences from the fastq file(s).
  
Stats go to stderr, unless -o is specified.
+
Stats go to stderr, unless <code>-o</code> is specified.
  
Specify -0 to turn off all default settings
+
Specify <code>-0</code> to turn off all default settings
  
 
If you specify multiple 'paired-end' inputs, then a -o option is
 
If you specify multiple 'paired-end' inputs, then a -o option is
 
required for each.  IE: -o read1.clip.q -o read2.clip.fq
 
required for each.  IE: -o read1.clip.q -o read2.clip.fq
  
Options:
+
====Options====
 
     -h      This help
 
     -h      This help
 
     -o FIL  Output file (stats to stdout)
 
     -o FIL  Output file (stats to stdout)
Line 50: Line 49:
 
     -d      Output lots of random debugging stuff
 
     -d      Output lots of random debugging stuff
  
Quality adjustment options:
+
====Quality adjustment options====
 
     --cycle-adjust    CYC,AMT    Adjust cycle CYC (negative = offset from end) by amount AMT
 
     --cycle-adjust    CYC,AMT    Adjust cycle CYC (negative = offset from end) by amount AMT
 
     --phred-adjust    SCORE,AMT  Adjust score SCORE by amount AMT
 
     --phred-adjust    SCORE,AMT  Adjust score SCORE by amount AMT
  
Filtering options*:
+
====Filtering options====
 
     --[mate-]qual-mean  NUM      Minimum mean quality score
 
     --[mate-]qual-mean  NUM      Minimum mean quality score
 
     --[mate-]qual-gt    NUM,THR  At least NUM quals > THR
 
     --[mate-]qual-gt    NUM,THR  At least NUM quals > THR
Line 86: Line 85:
  
 
Quality filters are evaluated after clipping/trimming
 
Quality filters are evaluated after clipping/trimming
</code>
+
 
 +
==Links==
 +
 
 +
[https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMcf.md fastq-mcf on GitHub]

Latest revision as of 17:35, 25 March 2019

Contents

[edit] Introduction

fastq-mcf attempts to:

  • Detect & remove sequencing adapters and primers
  • Detect limited skewing at the ends of reads and clip
  • Detect poor quality at the ends of reads and clip
  • Detect Ns, and remove from ends
  • Remove reads with CASAVA 'Y' flag (purity filtering)
  • Discard sequences that are too short after all of the above
  • Keep multiple mate-reads in sync while doing all of the above

[edit] Usage

Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
Detects levels of adapter presence, computes likelihoods and locations (start, end) of the adapters. Removes the adapter sequences from the fastq file(s).

Stats go to stderr, unless -o is specified.

Specify -0 to turn off all default settings

If you specify multiple 'paired-end' inputs, then a -o option is required for each. IE: -o read1.clip.q -o read2.clip.fq

[edit] Options

   -h       This help
   -o FIL   Output file (stats to stdout)
   -s N.N   Log scale for adapter minimum-length-match (2.2)
   -t N     % occurance threshold before adapter clipping (0.25)
   -m N     Minimum clip length, overrides scaled auto (1)
   -p N     Maximum adapter difference percentage (10)
   -l N     Minimum remaining sequence length (19)
   -L N     Maximum remaining sequence length (none)
   -D N     Remove duplicate reads : Read_1 has an identical N bases (0)
   -k N     sKew percentage-less-than causing cycle removal (2)
   -x N     'N' (Bad read) percentage causing cycle removal (20)
   -q N     quality threshold causing base removal (10)
   -w N     window-size for quality trimming (1)
   -H       remove >95% homopolymer reads (no)
   -0       Set all default parameters to zero/do nothing
   -U|u     Force disable/enable Illumina PF filtering (auto)
   -P N     Phred-scale (auto)
   -R       Dont remove Ns from the fronts/ends of reads
   -n       Dont clip, just output what would be done
   -C N     Number of reads to use for subsampling (300k)
   -S       Save all discarded reads to '.skip' files
   -d       Output lots of random debugging stuff

[edit] Quality adjustment options

   --cycle-adjust    CYC,AMT     Adjust cycle CYC (negative = offset from end) by amount AMT
   --phred-adjust    SCORE,AMT   Adjust score SCORE by amount AMT

[edit] Filtering options

   --[mate-]qual-mean  NUM       Minimum mean quality score
   --[mate-]qual-gt    NUM,THR   At least NUM quals > THR
   --[mate-]max-ns     NUM       Maxmium N-calls in a read (can be a %)
   --[mate-]min-len    NUM       Minimum remaining length (same as -l)
   --hompolymer-pct    PCT       Homopolymer filter percent (95)

If mate- prefix is used, then applies to second non-barcode read only

Adapter files are 'fasta' formatted:

Specify n/a to turn off adapter clipping, and just use filters

Increasing the scale makes recognition-lengths longer, a scale of 100 will force full-length recognition of adapters.

Adapter sequences with _5p in their label will match 'end's, and sequences with _3p in their label will match 'start's, otherwise the 'end' is auto-determined.

Skew is when one cycle is poor, 'skewed' toward a particular base. If any nucleotide is less than the skew percentage, then the whole cycle is removed. Disable for methyl-seq, etc.

Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done for miRNA, amplicon and other low-complexity situations!)

Duplicate read filtering is appropriate for assembly tasks, and never when read length < expected coverage. -D 50 will use 4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.

Quality filters are evaluated after clipping/trimming

[edit] Links

fastq-mcf on GitHub

Personal tools
Namespaces

Variants
Actions
BioUML platform
Community
Modelling
Analysis & Workflows
Collaborative research
Development
Virtual biology
Wiki
Toolbox