1. How is the database constructed
Four steps are used to construct the searchable database: 1) download all genome sequences and annotation files; 2) perform within-species blast for the query species to screen for single-copy CDS; 3) blast the single-copy query sequences against all potential reference genomes; 4) get user provided parameters from the web interface, parse the blast results and send the results to the output webpages.
The fasta files of genome sequences were downloaded from Ensembl ( http://www.ensembl.org/), DOE Joint Genome Institute ( http://www.jgi.doe.gov/), Beijing Genomics Institute ( http://www.genomics.cn/), Human Genome Sequencing Center ( http://www.hgsc.bcm.tmc.edu/) and others. The gene annotation information was extracted from EMBL or GFF file retrieved from Ensembl and the other sources. If you cannot find your favorite species in the list, please fill the information in the Suggestion Form, indicating the species name and available genome resources. It will be added to the next version of EvolMarkers database.
2. How to search the database
The "Searching Markers" page provides users with several options. First, you can choose what type of markers (CDS or EPIC) you want to find. If CDS was selected, the minimum length of the CDS could be changed. If EPIC was the choice, the maximum intron size could be modified. You also will be asked for how many fasta files you want to save, because, sometimes, the search may return thousands of markers, but the web server has limited space to save fasta files for all of them. The "minimum identity in the coding part of EPIC markers" serves similar purpose, that it will only print EPIC markers with highly conserved flanking exon regions, if you select a high identity value.
Second, you have to choose one query species and one or multiple subject genomes. Usually, the most closely related species to the species of focal interest should be selected as the query species. If it is not available in the list of queries, it should be at least selected as the subject. For example, we are interested in developing phylogenetic markers for sharks, but there are no well-annotated shark genomes available, so we used zebrafish as the query and used 1.1 X genome sequence of elephant shark Callorhinchus milii as the subject. We found about a hundred candidate markers (>500 bp). Ten of 17 candidates tested were able to amplify 14 species across the chondrichthyan lineages and provided good amount of signal (Li C., K. Matthes and G. Naylor, unpublished data).
Another advice is that the candidate EPIC markers often need to be tested empirically for the focal species, because the intron length is highly variable between distantly related species (Li et al. , 2010).
3. The output files
There are two type of files on the output pages. One is the list of markers. For EPIC markers, the EPIC marker ID (named as the gene ID plus a sequential number), position of flanking exons in each genome, intron size, average identity of flanking exons and the gene information are listed. For CDS markers, the "onehitCDSmarkers" file includes the gene description, position of the marker in each genome, average identity, coverage and length of each marker. Only the query sequences with exactly one hit in all subject genomes were selected as CDS markers.
The other type of files are the fasta sequences of query and subject taxa for each marker. The EPIC markers are named using the gene ID plus a sequential number, since one gene might have more than one qualified EPIC markers. The position of intron is indicated by "XXXXX". The length of intron is the number appended in the end of each sequence. The CDS markers are named as the position on the genome plus the Ensembl gene ID. All fasta files are zipped into a folder. The files may be downloaded and unzipped. The sequences then can be aligned and used for designing "universal" primers.