Difference between revisions of "GTRD"

Latest revision as of 14:13, 22 October 2019

Gene Transcription Regulation Database (GTRD) is a database of transcription factor (TF) binding sites identified from ChIP-seq experiments that were systematically collected and uniformly processed using a special workflow (pipeline) for a BioUML platform (http://www.biouml.org). Raw ChIP-seq data and experiment information were collected from:

literature
GEO
SRA
ENCODE.

Initial ChIP-seq and DNase-seq raw data were uniformly processed using specially developed workflow (pipeline) for the BioUML platform. ChIP-seq processing pipeline included the following steps:

sequenced reads were aligned to the corresponding reference genome using Bowtie2;
peaks were identified using MACS2, SISSR, GEM and PICS peak callers;
peaks computed for the same TF and peak calling method, but different experiment conditions (e.g., cell line, treatment, etc.) were joined into clusters;
clusters for the same TF revealed by different peak calling methods were joined into metaclusters.

DNase-seq processing pipeline included the following steps:

sequenced reads were aligned to the corresponding reference genome using Bowtie2;
regions of open chromatin were identified using MACS2 and Hotspot2;
de novo putative protein-DNA interactions were revealed using a digital genomic footprinting tool Wellington.

Learn more about GTRD build process

GTRD database is freely available for non-commercial organizations.

[edit] Database statistics (18.06 version)

GTRD uses 17485 ChIP-seq experiments corresponding to 2399 UniProt IDs.

ChIP-seq experiments by species

Most of ChIP-seq experiments (79.2%) have corresponding control experiments.

Control experiments

General statistics:

Object type	Total count	Per ChIP-seq experiment
ChIP-seq reads	719.3 × 10⁹	41.1 × 10⁶
Reads aligned	539.8 × 10⁹	30.9 × 10⁶
ChIP-seq peaks	>1.1 × 10⁹	depends on peak caller

In average, each TF has been measured in 7.28 ChIP-seq experiments, 52% of TFs have been measured in more than one experiment.

Transription factors by species

Ten most studied TFs are listed bellow:

uniprot ID	Transcription Factor	Species	Number of ChIP-seq experiments
P49711	CTCF	Homo sapiens	432
P03372	SR1	Homo sapiens	280
P10275	ANDR	Homo sapiens	259
Q61164	CTCF	Mus musculus	255
P17433	SPI1	Mus musculus	156
P55317	FOXA1	Homo sapiens	125
P01106	MYC	Homo sapiens	122
P28033	CEBPB	Mus musculus	100
Q04206	TF65	Homo sapiens	93
P04150	GCR	Homo sapiens	90

GTRD contains 843 DNase-seq experiments.

DNase-seq experiments by species

[edit] Database structure

The metadata concerning GTRD is stored in MySQL tables.

Each ChIP-seq and DNase-seq experiment has a row in 'chip_experiments' or 'dnase-experiments' table, respectively, which assigns id and stores basic information about experiment. 'chip_experiments' table has the following structure:

Column	Description	Example value
id	Unique experiment identifier	EXP000489
antibody	Antibody used in chromatin immunoprecipitation	sc-345
specie	Species latin name	Homo sapiens
treatment	Cell treatment or conditions	IFN gamma
control_id	Id of control experiment, NULL for control experiments or experiments without control	EXP000490
cell_id	Studied cell line ID	423
experiment_type		unspecified
tf_uniprot_id	Protein id in Uniprot database	P42224

'dnase_experiments' table has the following structure:

Column	Description	Example value
id	Unique experiment identifier	DEXP000044
organism	Species latin name	Homo sapiens
treatment	Cell treatment or conditions	100nM dexamethasone (CHEBI:41879) for 6 hour
cell_id	Studied cell line ID	38

The links to external databases stored in 'external_refs' table:

Column	Description	Example values
id	Experiment identifier	EXP000489
external_db	External database name	GEO or PUBMED or ENCODE or SRA
external_db_id	Identifier in external database	GSM320736

GTRD uses the following object identifiers:

Template	Object type	Example
EXPXXXXXX	ChIP-seq experiment	EXP000489
READSXXXXXX	Collection of ChIP-seq reads	READS000770
ALIGNSXXXXXX	Collection of ChIP-seq reads alignments	ALIGNS010001
PEAKSXXXXXX	Collection of DNase-seq peaks	PEAKS010000
DEXPXXXXXX	DNase-seq experiment	DEXP000044
DREADSXXXXXX	Collection of DNase-seq reads	DREADS000232
DALIGNSXXXXXX	Collection of DNase-seq reads alignments	DALIGNS000044
DPEAKSXXXXXX	Collection of DNase-seq peaks	DPEAKS000044

The relationship between these objects is provided by a 'hub' table:

Column	Description	Example values
input	Input object identifier	READS000770
input_type	Type of input object	ReadsGTRDType
output	Output object identifier	EXP000489
output_type	Type of output object	ExperimentGTRDType

ChIP-seq and DNase-seq reads, alignments and peaks links to experiments with hub table in the following way:

input	input_type	output	output_type	specie
READS000770	ReadsGTRDType	EXP000489	ExperimentGTRDType	Homo sapiens
READS000771	ReadsGTRDType	EXP000489	ExperimentGTRDType	Homo sapiens
READS000772	ReadsGTRDType	EXP000489	ExperimentGTRDType	Homo sapiens
READS000773	ReadsGTRDType	EXP000489	ExperimentGTRDType	Homo sapiens
READS000774	ReadsGTRDType	EXP000489	ExperimentGTRDType	Homo sapiens
READS000775	ReadsGTRDType	EXP000489	ExperimentGTRDType	Homo sapiens
ALIGNS033052	AlignmentsGTRDType	EXP000489	ExperimentGTRDType	Homo sapiens
PEAKS033050	PeaksGTRDType	EXP000489	ExperimentGTRDType	Homo sapiens
DALIGNS000044	DNaseAlignments	DEXP000044	DNaseExperiment	Homo sapiens
DPEAKS000044	DNasePeaks	DEXP000044	DNaseExperiment	Homo sapiens
DREADS000232	DNaseReads	DEXP000044	DNaseExperiment	Homo sapiens
DREADS000233	DNaseReads	DEXP000044	DNaseExperiment	Homo sapiens
DREADS000234	DNaseReads	DEXP000044	DNaseExperiment	Homo sapiens
DREADS000235	DNaseReads	DEXP000044	DNaseExperiment	Homo sapiens

[edit] Web interface

Web interface to GTRD is available here. It provides capabilities for searching and browsing GTRD.

[edit] Start page

Start page contains search boxes and the links to browse all experiments in the repository tree, documentation, help (wiki pages) and transcription factor classification tree.

[edit] Search capabilities

Search GTRD

Open ChIP-seq peaks

The ChIP-seq experiments contained in GTRD can be queried from the search box on the Start page. GTRD uses Lucene engine for indexing and quering ChIP-seq experiments that provides rich syntax for searching. The search can be performed by transcription factor(name or UniProt ID), cell line, antibody or treatment/conditions.

For example, to search for STAT transcription factors you should input 'stat*' in the search field and press 'Enter'. You also can restrict the query to a certain cell line and treatment, e.g. to HeLa cells treated with interferon, using the following query:

 tfTitle:stat* AND cell:hela AND treatment:IFN

Similarly, it is possible to search by publication author, year or title. To search by full article title, quote it:

 "Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing."

To search by author or publication date, use "articles:" prefix. For example to search for experiments authored by Snyder in 2007:

 articles:Snyder AND  articles:2007

The list of matching ChIP-seq experiments will appear in the 'Search result' tab. Select one of them to view detailed ChIP-seq experiment information in the information box. The information box also provides links to the experiment data: reads, alignments and peaks.

To view peaks or alignments click the link in the ChIP-seq experiment information box, the track will be opened as table. The track can be exported by pressing 'Export' button or opened in the genome browser by pressing 'Open as track' button in the general control panel.

[edit] Repository structure

GTRD is organized in hierarchical Repository.

The GTRD/Data folder contains following items:

DNase experiments - DNase-seq experiments metainformation
alignments - ChIP-seq read alignments
clusters - ChIP-seq-derived meta-clusters
experiments - ChIP-seq experiments metainformation
peaks - ChIP-seq and DNase-seq peaks identified by peak callers

GTRD/Dictionaries/classification is TFClass classification tree used by GTRD to reference transcription factors. GTRD/Dictionaries/cells is a collection of cells and tissues used by GTRD

@@ Line 1: / Line 1: @@
 [[Category:Databases]]
-'''Gene Transcription Regulation Database''' ([http://gtrd.biouml.org GTRD]) is a database of transcription factor binding sites identified from ChIP-seq experiments that were systematically collected and uniformly processed using special workflow (pipeline) for BioUML platform. Raw ChIP-seq data and experiment information were collected from:
+'''Gene Transcription Regulation Database''' ([http://gtrd.biouml.org GTRD]) is a database of transcription factor (TF) binding sites identified from ChIP-seq experiments that were systematically collected and uniformly processed using a special workflow (pipeline) for a BioUML platform (http://www.biouml.org). Raw ChIP-seq data and experiment information were collected from:
 * literature
 * [http://www.ncbi.nlm.nih.gov/geo/ GEO]
@@ Line 6: / Line 6: @@
 * [http://genome.ucsc.edu/ENCODE/ ENCODE].
-Initial ChIP-seq and DNase-seq raw data were uniformly processed using specially developed workflow (pipeline) for BioUML platform.
+Initial ChIP-seq and DNase-seq raw data were uniformly processed using specially developed workflow (pipeline) for the BioUML platform.
-ChIP-seq processing pipeline:
+ChIP-seq processing pipeline included the following steps:
-* sequenced reads were aligned to reference genome using Bowtie2;
+* sequenced reads were aligned to the corresponding reference genome using Bowtie2;
-* peaks were identified using MACS, SISSR, GEM and PICS peak callers;
+* peaks were identified using MACS2, SISSR, GEM and PICS peak callers;
 * peaks computed for the same TF and peak calling method, but different experiment conditions (e.g., cell line, treatment, etc.) were joined into clusters;
 * clusters for the same TF revealed by different peak calling methods were joined into metaclusters.
-DNase-seq processing pipeline:
+DNase-seq processing pipeline included the following steps:
-* sequenced reads were aligned to reference genome using Bowtie2;
+* sequenced reads were aligned to the corresponding reference genome using Bowtie2;
 * regions of open chromatin were identified using MACS2 and Hotspot2;
-* to reveal de novo putative protein-DNA interactions digital genomic footprinting tool Wellington was used.
+* de novo putative protein-DNA interactions were revealed using a digital genomic footprinting tool Wellington.
 [[GTRD_Workflow|Learn more about GTRD build process]]
@@ Line 22: / Line 22: @@
 ==Database statistics (18.06 version)==
-GTRD uses 17485 ChIP-seq experiments correspond to 2399 uniprot IDs. [[File:GTRD statistics-by-species3.png|thumb|ChIP-seq experiments by species]]
+GTRD uses 17485 ChIP-seq experiments corresponding to 2399 UniProt IDs. [[File:GTRD statistics-by-species3.png|thumb|ChIP-seq experiments by species]]
-Most of ChIP-seq experiments (79.2%) have corresponding control experiment. [[File:GTRD statistics-control3.png|thumb|Control experiments]]
+Most of ChIP-seq experiments (79.2%) have corresponding control experiments. [[File:GTRD statistics-control3.png|thumb|Control experiments]]
 General statistics:
@@ Line 38: / Line 38: @@
 |}
-In average each TF has been measured in 7.28 ChIP-seq experiments, 52% of TFs have been measured in more than one experiment. [[File:GTRD tf-statistics-by-species3.png|thumb|Transription factors by species]]
+In average, each TF has been measured in 7.28 ChIP-seq experiments, 52% of TFs have been measured in more than one experiment. [[File:GTRD tf-statistics-by-species3.png|thumb|Transription factors by species]]
-The ten most studied transcription factors are listed bellow:
+Ten most studied TFs are listed bellow:
 {| class="wikitable"
@@ Line 73: / Line 73: @@
 Each ChIP-seq and DNase-seq experiment has a row in 'chip_experiments' or 'dnase-experiments' table, respectively, which assigns id and stores basic information about experiment.
-'chip_experiments' table has following structure:
+'chip_experiments' table has the following structure:
 {| class="wikitable"
@@ Line 96: / Line 96: @@
 |}
-'dnase_experiments' table has following structure:
+'dnase_experiments' table has the following structure:
 {| class="wikitable"
@@ Line 124: / Line 124: @@
 |}
-GTRD uses following object identifiers:
+GTRD uses the following object identifiers:
 {| class="wikitable"
 |-
@@ Line 146: / Line 146: @@
 |}
-The relationship between these objects is provided by 'hub' table:
+The relationship between these objects is provided by a 'hub' table:
 {| class="wikitable"
 |-
@@ Line 194: / Line 194: @@
 ==Web interface==
-{{stub|screenshots}}
 Web interface to GTRD is available [http://gtrd.biouml.org/bioumlweb/gtrd.html here].
@@ Line 200: / Line 199: @@
 ===Start page===
-Start page contains search box and the links to browse all experiments in the repository tree, documentation, help (wiki pages) and transcription factor classification tree.
+Start page contains search boxes and the links to browse all experiments in the repository tree, documentation, help (wiki pages) and transcription factor classification tree.
-[[File:gtrd_startpage3.png|GTRD start page]]
+[[File:gtrd_startpage3.png|GTRD start page|1100px]]
 ===Search capabilities===
@@ Line 210: / Line 209: @@
 The ChIP-seq experiments contained in GTRD can be queried from the search box on the Start page.
 GTRD uses [http://en.wikipedia.org/wiki/Lucene Lucene] engine for indexing and quering ChIP-seq experiments that provides rich [http://lucene.apache.org/core/2_9_4/queryparsersyntax.html syntax] for searching.
-The search can be performed by transcription factor(name or uniprot ID), cell line, antibody or treatment/conditions.
+The search can be performed by transcription factor(name or UniProt ID), cell line, antibody or treatment/conditions.
-For example, to search for STAT transcription factors enter stat* in the search field and press enter.
+For example, to search for STAT transcription factors you should input 'stat*' in the search field and press 'Enter'.
-You can restrict query to HeLa cells treated with interferon with following query:
+You also can restrict the query to a certain cell line and treatment, e.g. to HeLa cells treated with interferon, using the following query:
    tfTitle:stat* AND cell:hela AND treatment:IFN
+Similarly, it is possible to search by publication author, year or title. To search by full article title, quote it:
+  "Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing."
+To search by author or publication date, use "articles:" prefix. For example to search for experiments authored by Snyder in 2007:
+  articles:Snyder AND  articles:2007
 The list of matching ChIP-seq experiments will appear in the 'Search result' tab.
@@ Line 225: / Line 228: @@
 GTRD is organized in hierarchical [[Repository]].
-[[File:gtrd_repository3.png|GTRD repository|100x150px]]
+[[File:gtrd_repository3.png|GTRD repository|235x307px]]
 The GTRD/Data folder contains following items:
@@ Line 233: / Line 236: @@
 *experiments - ChIP-seq experiments metainformation
 *peaks - ChIP-seq and DNase-seq peaks identified by peak callers
-*views - TFClass centric view of GTRD data avalable as {{Type link|tree-table}}s
 GTRD/Dictionaries/classification is [http://www.edgar-wingender.de/huTF_classification.html TFClass] classification tree used by GTRD to reference transcription factors.
 GTRD/Dictionaries/cells is a collection of cells and tissues used by GTRD

Difference between revisions of "GTRD"

Latest revision as of 14:13, 22 October 2019

Contents

[edit] Database statistics (18.06 version)

[edit] Database structure

[edit] Web interface

[edit] Start page

[edit] Search capabilities

[edit] Repository structure

Personal tools

Namespaces

Variants

Views

Actions

Search

BioUML platform

Community

Modelling

Analysis & Workflows

Collaborative research

Development

Virtual biology

Wiki

Toolbox