1. How is the database constructed
Four steps are used to construct the searchable database: 1) download all genome sequences and
annotation files; 2) perform within-species blast for the query species to screen for single-copy CDS;
3) blast the single-copy query sequences against all potential reference genomes; 4) get user
provided parameters from the web interface, parse the blast results and send the results to the output
webpages.
The fasta files of genome sequences were downloaded from Ensembl (
http://www.ensembl.org/), DOE Joint Genome Institute (
http://www.jgi.doe.gov/), Beijing Genomics Institute (
http://www.genomics.cn/), Human Genome Sequencing Center (
http://www.hgsc.bcm.tmc.edu/) and others. The gene annotation information was extracted from EMBL
or GFF file retrieved from Ensembl and the other sources. If you cannot find your favorite species
in the list, please fill the information in the Suggestion Form, indicating
the species name and available genome resources. It will be added to the next version of EvolMarkers
database.
2. How to search the database
The "Searching Markers" page provides users with several options. First, you can choose what type of
markers (CDS or EPIC) you want to find. If CDS was selected, the minimum length of the CDS could be
changed. If EPIC was the choice, the maximum intron size could be modified. You also will be asked
for how many fasta files you want to save, because, sometimes, the search may return thousands of
markers, but the web server has limited space to save fasta files for all of them. The "minimum identity
in the coding part of EPIC markers" serves similar purpose, that it will only print EPIC markers with
highly conserved flanking exon regions, if you select a high identity value.
Second, you have to choose one query species and one or multiple subject genomes. Usually, the most
closely related species to the species of focal interest should be selected as the query species. If
it is not available in the list of queries, it should be at least selected as the subject. For
example, we are interested in developing phylogenetic markers for sharks, but there are no
well-annotated shark genomes available, so we used zebrafish as the query and used 1.1 X genome
sequence of elephant shark Callorhinchus milii as the subject. We found about a hundred
candidate markers (>500 bp). Ten of 17 candidates tested were able to amplify 14 species across the
chondrichthyan lineages and provided good amount of signal (Li C., K. Matthes and G. Naylor,
unpublished data).
Another advice is that the candidate EPIC markers often need to be tested empirically for the focal
species, because the intron length is highly variable between distantly related species (Li et al. ,
2010).
3. The output files
There are two type of files on the output pages. One is the list of markers. For EPIC markers, the
EPIC marker ID (named as the gene ID plus a sequential number), position of flanking exons in each
genome, intron size, average identity of flanking exons and the gene information are listed. For CDS
markers, the "onehitCDSmarkers" file includes the gene description, position of the marker
in each genome, average identity, coverage and length of each marker. Only the query sequences with
exactly one hit in all subject genomes were selected as CDS markers.
The other type of files are the fasta sequences of query and subject taxa for each marker. The EPIC
markers are named using the gene ID plus a sequential number, since one gene might have more than one
qualified EPIC markers. The position of intron is indicated by "XXXXX". The length of intron is the
number appended in the end of each sequence. The CDS markers are named as the position on the genome
plus the Ensembl gene ID. All fasta files are zipped into a folder. The files may be downloaded and
unzipped. The sequences then can be aligned and used for designing "universal" primers.