Difference between revisions of "Matching BioHub"
Tagir Valeev (Talk | contribs) (Notes on SQL connection, subsections - →Implementation of SQL-based matching hub) |
Tagir Valeev (Talk | contribs) m (Sections level changed) |
||
Line 1: | Line 1: | ||
'''Matching BioHub''' is a kind of [[BioHub]] which allows you to match a list of identifiers from one [[Reference type (extension point)|reference type]] to another (including if necessary cross-species matching). In [[BioUML]] code matching BioHub is a Java class which implements {{Class|ru.biosoft.access.biohub.BioHub}} interface. | '''Matching BioHub''' is a kind of [[BioHub]] which allows you to match a list of identifiers from one [[Reference type (extension point)|reference type]] to another (including if necessary cross-species matching). In [[BioUML]] code matching BioHub is a Java class which implements {{Class|ru.biosoft.access.biohub.BioHub}} interface. | ||
− | + | == Technical details == | |
There's a '''matching graph''' defined where each node is a combination of [[reference type]] and species and each edge is a matching procedure implemented by matching hub. Usually edges connect nodes of single species, but cross-species hubs are also possible and used in [[File:Data-Convert-table-via-homology-icon.png]] [[Convert table via homology (analysis)|Convert table via homology]] analysis. | There's a '''matching graph''' defined where each node is a combination of [[reference type]] and species and each edge is a matching procedure implemented by matching hub. Usually edges connect nodes of single species, but cross-species hubs are also possible and used in [[File:Data-Convert-table-via-homology-icon.png]] [[Convert table via homology (analysis)|Convert table via homology]] analysis. | ||
Line 13: | Line 13: | ||
Each matching hub can define several matching edges. When you request a matching between given nodes using {{Method|ru.biosoft.access.biohub.BioHubRegistry.getMatchingPath(Properties, Properties)}}, it performs a Dijkstra search within matching graph looking for the path with minimal matching qualities product. | Each matching hub can define several matching edges. When you request a matching between given nodes using {{Method|ru.biosoft.access.biohub.BioHubRegistry.getMatchingPath(Properties, Properties)}}, it performs a Dijkstra search within matching graph looking for the path with minimal matching qualities product. | ||
− | + | == Debugging matching graph == | |
You may debug matching graph using the [[Biohub (host object)|biohub]] JavaScript host object in [[script viewpart]]. Use <code>getReachableTypes</code> method to retrieve all [[reference type]]s reachable from given node. Use <code>getMatchingPlan</code> method to retrieve list of optimal matching steps between given nodes. The <code>matchDebug</code> method will provide a verbose output for matching procedure of given identifier between given nodes. | You may debug matching graph using the [[Biohub (host object)|biohub]] JavaScript host object in [[script viewpart]]. Use <code>getReachableTypes</code> method to retrieve all [[reference type]]s reachable from given node. Use <code>getMatchingPlan</code> method to retrieve list of optimal matching steps between given nodes. The <code>matchDebug</code> method will provide a verbose output for matching procedure of given identifier between given nodes. | ||
Revision as of 15:57, 3 July 2013
Matching BioHub is a kind of BioHub which allows you to match a list of identifiers from one reference type to another (including if necessary cross-species matching). In BioUML code matching BioHub is a Java class which implements BioHub
interface.
Contents |
Technical details
There's a matching graph defined where each node is a combination of reference type and species and each edge is a matching procedure implemented by matching hub. Usually edges connect nodes of single species, but cross-species hubs are also possible and used in Convert table via homology analysis.
Node is defined by Properties
object which has the following keys:
-
TYPE_PROPERTY
(ReferenceType): stable name of the node reference type (example: 'EnsemblGeneTableType'); -
SPECIES_PROPERTY
(Species): latin name of the node species (example: 'Homo sapiens').
Each edge is characterized by matching quality, which is a number between 0 and 1 (inclusive). Quality 1 means the best matching quality possible.
Each matching hub can define several matching edges. When you request a matching between given nodes using BioHubRegistry.getMatchingPath(Properties, Properties)
, it performs a Dijkstra search within matching graph looking for the path with minimal matching qualities product.
Debugging matching graph
You may debug matching graph using the biohub JavaScript host object in script viewpart. Use getReachableTypes
method to retrieve all reference types reachable from given node. Use getMatchingPlan
method to retrieve list of optimal matching steps between given nodes. The matchDebug
method will provide a verbose output for matching procedure of given identifier between given nodes.
Implementation of SQL-based matching hub
The easiest way to implement your own matching hub is to prepare a special MySQL database and subclass SQLBasedHub
.
Database
The database schema is the following:
CREATE TABLE `hub` ( `input` varchar(20) NOT NULL, `input_type` int NOT NULL, `output` varchar(20) NOT NULL, `output_type` int NOT NULL, `specie` int NOT NULL, KEY `input` (`input`,`specie`,`output_type`), KEY `output` (`output`,`specie`,`input_type`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `hub_terms` ( `id` int primary key, `term` varchar(100) not null, key `term` (`term`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
So two tables called hub and hub_terms must present. Of course, you can use the same database for other purposes as well. The input_type, output_type and specie fields refer to the terms in hub_terms table via hub_terms.id field. The input_type and output_type fields must refer to reference type stable name (usually the reference type class simple name; see ReferenceTypeSupport.getStableName()
). The specie field must refer to the Latin name of the species. The input and output fields contains the identifier of the input type and the converted identifier of the output type respectively.
Class
After creating the database you must subclass the SQLBasedHub
class providing the SQLBasedHub.getMatchings()
method implementation. This method must return an array of Matching
objects which are constructed via the following parameters:
- inputType: ReferenceType class for the input type.
- outputType: ReferenceType class for the output type.
- forward: if false, then matching will be performed in backwards direction, thus input id will be looked in hub.output field and output id will be returned from hub.input field.
- quality: the quality of given matching (between 0 and 1).
Connection
The SQLBasedHub.getConnection()
method must return a Connection
to MySQL database. The default implementation works as follows:
- If BioHub is registered within SQL module, then module default SQL connection is used.
- If BioHub has properties jdbcURL, jdbcUser and jdbcPassword, then these properties are used to create a connection. See biohub extension point for details on how to specify BioHub properties.
- Otherwise hub will be disabled.
If this algorithm doesn't satisfy you, you may override this method to create the Connection
in a custom way.