Difference between revisions of "Hadoop"

From BioUML platform
Jump to: navigation, search
Line 73: Line 73:
 
Seal is part of the Biodoop suite of tools.
 
Seal is part of the Biodoop suite of tools.
 
|http://biodoop-seal.sourceforge.net/
 
|http://biodoop-seal.sourceforge.net/
 +
|-
 +
|CloudAligner <cite>Nguyen2011</cite>
 +
|A fast and full-featured MapReduce based tool for sequence mapping.
 +
One of the first work for Hadoop, the article has a lot of details.
 +
Amazon EC2, Java
 +
|http://sourceforge.net/projects/cloudaligner/files/
 
|-
 
|-
 
|
 
|
Line 88: Line 94:
 
|http://github.com/ibm-bioinformatics/bluesnp
 
|http://github.com/ibm-bioinformatics/bluesnp
 
|}
 
|}
 +
 +
  
  
Line 101: Line 109:
 
#Niemenmaa2012 pmid=22302568
 
#Niemenmaa2012 pmid=22302568
 
#Hong2012 pmid=2225766
 
#Hong2012 pmid=2225766
 +
#Nguyen2011 pmid=21645377
 
</biblio>
 
</biblio>

Revision as of 02:01, 17 November 2013

This page or section is a stub. Please add more information here!

The open source Apache Hadoop project, which adopts the MapReduce framework and a distributed file system, has recently given bioinformatics researchers an opportunity to achieve scalable, efficient and reliable computing performance on Linux clusters and on cloud computing services.

Survey of MapReduce frame operation in bioinformatics [1].


List of Hadoop applications for NGS

Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing data sets[2].

Tool, Ref Description URL
SeqPig [2] A library and a collection of tools to manipulate, analyze and query sequencing data sets in a scalable and simple manner.

SeqPig scripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks.

http://sourceforge.net/projects/seqpig/

http://seqpig.sourceforge.net/ (manual)

Hadoop-BAM [3] Hadoop-BAM is a Java library for the manipulation of files in common bioinformatics formats using the Hadoop MapReduce framework with the Picard SAM JDK, and command line tools similar to SAMtools.

The file formats currently supported are BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF. SeqPig project provides a higher-level interface to the file formats supported by Hadoop-BAM

http://sourceforge.net/projects/hadoop-bam/
BioPig [4] BioPig is based on the Apache's Hadoop MapReduce system and the Pig data flow language. https://sites.google.com/a/lbl.gov/biopig/
DistMap [5] A modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework.

It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms.

Currently, DistMap supports 9 mappers:

http://code.google.com/p/distmap/

http://code.google.com/p/distmap/wiki/Manual

Eoulsan [6] Eoulsan provides an integrated and flexible solution for RNA-Seq data analysis of differential expression.

Amazon EC2, Java

http://transcriptome.ens.fr/eoulsan/
FX [7] FX is an RNA-Seq analysis tool, which runs in parallel on cloud computing infrastructure, for the estimation of gene expression levels and genomic variant calling.

Input:

  • FASTQ formatted, paired-end raw sequences
  • SAM formatted form(generated by other alignment tools)

Output

  • List of SNPs, indels formatted as ANNOVAR input format
  • Expression profile
  • Other intermediate files such as GSNAP alignment output files, total alignment depth (coverage) against the reference genome generated from ‘Base Call' step, and list of indels before filtering to give researchers room to do their own analysis of interest.

Amazon EC2, Java

http://fx.gmi.ac.kr
Biodoop Current applications focus on sequence alignment and manipulation of alignment records. Applications generally run on the Pydoop API for Hadoop.

Currently, Biodoop’s core contains a few modules for handling FASTA streams, wrappers for BLAST, I/O modules for some bio formats, a module for converting sequences to the nib format and protobuf serializers for several objects.

http://biodoop.sourceforge.net/core/
Seal Seal is a suite of distributed applications for aligning short DNA reads, and manipulating and analyzing short read alignments.

Currently it includes tools for: read demultiplexing, read alignment, duplicate read removal, sorting read mappings, and calculating statistics for empirical base quality recalibration. Seal is part of the Biodoop suite of tools.

http://biodoop-seal.sourceforge.net/
CloudAligner [8] A fast and full-featured MapReduce based tool for sequence mapping.

One of the first work for Hadoop, the article has a lot of details. Amazon EC2, Java

http://sourceforge.net/projects/cloudaligner/files/
Crossbow [9] cloud computing tool for identifying SNPs from high-coverage, short-read resequencing data. Two robust tools, Bowtie and SOAPsnp, implement the fundamental alignment and variant calling operations respectively, and have demonstrated capabilities within Crossbow of analyzing approximately one billion short reads per hour on a commodity Hadoop cluster with 320 cores.

Amazon EC2

BlueSNP [10] R package which distributes GWAS computation over a cluster configured with the Hadoop framework, making computationally intensive analyses, such as estimating empirical p-values via data permutation, and searching for expression quantitative trait loci over thousands of genes, feasible for large genotype-phenotype datasets.

It uses RHIPE R package (http://www.datadr.org) for authoring and running MapReduce programs from within the R environment.

http://github.com/ibm-bioinformatics/bluesnp



References

Error fetching PMID 23396756:
Error fetching PMID 24149054:
Error fetching PMID 24021384:
Error fetching PMID 24009693:
Error fetching PMID 23202745:
Error fetching PMID 22948728:
Error fetching PMID 22492314:
Error fetching PMID 22302568:
Error fetching PMID 2225766:
Error fetching PMID 21645377:
  1. Error fetching PMID 23396756: [Zou2013]
  2. Error fetching PMID 24149054: [Schumacher2013]
  3. Error fetching PMID 22302568: [Niemenmaa2012]
  4. Error fetching PMID 24021384: [Nordberg2013]
  5. Error fetching PMID 24009693: [Pandey2013]
  6. Error fetching PMID 22492314: [Jourdren2012]
  7. Error fetching PMID 2225766: [Hong2012]
  8. Error fetching PMID 21645377: [Nguyen2011]
  9. Error fetching PMID 22948728: [Gurtowski2012]
  10. Error fetching PMID 23202745: [Huang2013]
All Medline abstracts: PubMed | HubMed
Personal tools
Namespaces

Variants
Actions
BioUML platform
Community
Modelling
Analysis & Workflows
Collaborative research
Development
Virtual biology
Wiki
Toolbox