Site prediction
Prediction of TF-binding sites of given TF in whole genome or in given chromosome fragment or in ChIP-Seq dataset. Sites are predicted by different Position Weight Matrix (PWM) methods.
Description
The following 6 PWM methods (models) were available:
- Given HOCOMOCO site models (Kulakovskiy et al, 2016). These models are available in HOCOMOCO database. They are located at "databases/HOCOMOCO v11/Data/PWM_HUMAN_mono_pval=0.0001”
- MATCH models (Kel et al, 2003);
- Additive IPS (Individual Probability Score) models, or briefly, IPS models (Volkova et al, 2018);
- Multiplicative IPS models. These models can be reduced to equivalent additive IPS models by taking logarithms of matrix elements;
- Common additive models;
- Common multiplicative models.
For determination of common additive and multiplicative models let’s matrix MAT = (mij), i={A,C,G,T}, denotes the given frequency matrix, j=1,...,l and l denotes the length of sites. For this analysis we used HOCOMOCO frequency matrices available in HOCOMOCO database (Kulakovskiy et al, 2016). These matrices They are located at “"databases/HOCOMOCO v11/Data/PCM_HUMAN_mono/”. To test an arbitrary DNA fragment S=(s1,...,sl), the common additive score x is determined using a standard way:
where the score(j), j=1,…,l, are determined as follows:
The common multiplicative score y is determined by formula:
If the the calculated score (x or y) exceeds the pre-specified threshold, then the tested DNA fragment S is declared as the predicted site. It is important to note that common multiplicative model can be converted to equivalent additive model by taking logarithms of matrix elements, i.e.
where the values score*(j), j=1,…,l, are determined as follows:
References
Kulakovskiy,I.V., Vorontsov,I.E., Yevshin,I.S., Soboleva,A.V., Kasianov,A.S., Ashoor,H., Ba-Alawi,W., Bajic,V.B., Medvedeva,Y.A., Kolpakov,F.A. et al. (2016) HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res., 44, D116–D125.
Kel, A.E., Gobling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O.V. and Wingender, E. (2003) MATCHTM: a tool for searching transcription factor binding sites in DNA sequences, Nucleic Acids Res., 31, p.3576-3579.
Volkova OA, Kondrakhin YV, Kashapov TA, Sharipov RN. Comparative analysis of protein-coding and long non-coding transcripts based on RNA sequence features. J Bioinform Comput Biol. 2018 Apr;16(2):1840013. doi: 10.1142/S0219720018400139.
Analysis Parameters:
- Sequence set type – Select type of sequences;
- Available sequence types: 1) Whole genome 2) Chromosome fragment 3) ChIP-Seq peaks from given track
- Sequences collection – Select a source of nucleotide sequences
- Sequences source – Select database to get sequences from or 'Custom' to specify sequences location manually
- Sequence collection – Specify path to folder containing sequences if 'Custom' sequences source is selected
- If Sequence set type = Chromosome fragment
- Chromosome name – Select chromosome name
- Start position – Type start position of chromosome fragment
- Finish position – Type finish position of chromosome fragment
- If Sequence set type = ChIP-Seq peaks from given track
- Path to track – Select Path to track with ChIP-Seq dataset; For example, track from GTRD database can be selected, i.e. Path to track = databases/GTRD/Data/peaks/gem/PEAKS033057
- Site name – type name of predicted sites
- Prediction models – Define prediction models. User can define several prediction models.
- modelName – Type model name
- siteType – Select site type;
- Available site types: 1) Given site model 2) IPS model 3) Multiplicative IPS model 4) Common additive model 5) Common multiplicative model 6) MATCH model
- If siteType = Given site model
- modelPath – Input path to given site prediction model. In particular, user can select given site model from HOCOMOCO database, such as "databases/HOCOMOCO v11/Data/PWM_HUMAN_mono_pval=0.0001/CEBPA_HUMAN.H11MO.0.A"
- If siteType ≠ Given site model
- matrixPath - Input path to given frequency matrix . In particular, user can select given frequency matrix from HOCOMOCO database, such as "databases/HOCOMOCO v11/Data/PCM_HUMAN_mono/CEBPA_HUMAN.H11MO.0.A"
- threshold - Type threshold
- The output track name – Type the output track name
- Path to output folder – Path to output fold