Difference between revisions of "ChIP-seq Analysis"
Line 2: | Line 2: | ||
Processing raw ChIP-seq data and calculation of quality control metrics for TFBS datasets and AUCs-applications | Processing raw ChIP-seq data and calculation of quality control metrics for TFBS datasets and AUCs-applications | ||
=== Use case === | === Use case === | ||
− | |||
{{YouTube|id=JTVbHIzm8HI|title=ChIP-seq FPCM and FNCM Quality Analysis use case}} | {{YouTube|id=JTVbHIzm8HI|title=ChIP-seq FPCM and FNCM Quality Analysis use case}} | ||
==== FPCM and FNCM ==== | ==== FPCM and FNCM ==== |
Revision as of 16:38, 25 March 2019
Contents |
Description
Processing raw ChIP-seq data and calculation of quality control metrics for TFBS datasets and AUCs-applications
Use case
FPCM and FNCM
To determine FPCM (False Positive Control Metric) and FNCM (False Negative Control Metric), it is necessary initially to merge all given datasets of Transcription Factor Binding Regions (TFBRs). FPCM is determined the ratio of observed number f of orphans to the expected number of genuine orphans fe , i.e.
where orphans were defined as such TFBRs that did not overlap with another TFBRs and expected number fe was estimated with the help of Poisson distribution. If the value of the FPCM is closer to 1.0 then the false positive binding regions are almost completely absent. High FPCM values (for instance, FPCM > 2) indicate that the majority (at least, more than half) of the observed orphans are false positives.Thus, if FPCM exceeds the given threshold, such as FPCM0 = 2 or 3, then FPCM recommends to modify datasets by removing orphans.
To control the false negative rates in given datasets of TFBRs, we defined FNCMs for each of them. Thus, FNCM was defined for given dataset D as the ratio of observed number of TFBRs in D to the estimated number of genuine TFBRs, say, Ne, i.e.
where Ne is estimated with the help of combination of known estimates such as Chao’s estimate, Lanumteang-Bohling’s estimate, Zelterman’s estimate, maximum likelihood estimate or Chapman’s estimate. The FNCM varies in the range [0.0, 1.0]. The closer the value of the FNCM is to 1.0, the lower is the rate of false negatives, while values closer to 0.0 indicate that high number of genuine TFBRs were overlooked.
Analysis Parameters:
- Data Type – Select data Type
ChIP-seq raw data processing parameters
- Experiment library layout – Experiment library layout
- FASTQ-file – ChIP-seq raw data
- Has Input Control? – Has Input Control?
- Input control library layout – Input control library layout
- Reference Genome – Reference Genome
- Select BED-files – Select BED-files
FPCM and FNCM Estimation Parameters
- Minimal length of binding region – Binding regions shorter than the minimal length will be extended
- Maximal length of binding region – Binding regions longer than the minimal length will be narrowed
- FPCM threshold – If the FPCM exceeds the threshold, the FNCM will be calculated based on Chapman’s estimate
- Calculate AUC? – Calculate AUC?
AUC Calculation Parameters – AUC Calculation Parameters
- Path to folder with site models – Select a folder with transcription factor binding sites (TFBSs) models
- Site model name – Select a TFBS model name
- Sequences collection – Select a source of nucleotide sequences
- Sequences source – Select database to get sequences from or 'Custom' to specify sequences location manually
- Sequence collection – Specify path to folder containing sequences if 'Custom' sequences source is selected
- Path to output folder – Path to output folder