Cassava (Manihot esculenta, Euphorbiaceae), an allotetraploid known for its remarkable tolerance to abiotic stresses, is an important source of energy for humans and animals and is used as a raw material for many industrial processes. Therefore, cassava is considered as one of the most useful starch crops and is expected to be used as an energy source in industries and as a food source. There is an increasing need for information, such as published literature and sequence registration in public databanks, on cassava.
The Cassava Online Archive provides cassava mRNA sequences and ESTs currently available from NCBI (Genbank/EMBL/DDBJ) and their annotations. So far, this database allows searches with gene function, accession number, and sequence similarity (BLAST). The annotations in the Cassava Online Archive are based on the similarity search results collated from several protein databases and the similarity map results from the castor bean (Ricinus communis, Euphorbiaceae), poplar (Populus trichocarpa, Salicaceae), grape (Vitis vinifera, Vitaceae), and Arabidopsis thaliana (Brassicaceae) genome sequences because the dataset of the cassava genome sequence is still not available. In order to improve the annotations, domain organization as predicted by InterProScan and Gene Ontology (GO) terms are included in this database.
We plan to expand the contents of Cassava Online Archive by including information on its biochemical pathways, diverse genetic types with useful traits, molecular markers that can be used for mapping, etc.
Express sequence tags (ESTs) and mRNA sequences of cassava available in the National Center for Bioinformatics (NCBI) database were assembled by using the CAP3 program; redundant sequences were omitted from the database. The non-redundant sequences were used as query sequences for the annotation process of this database.
We performed the CAP3 program using "-p 95 -z 1" options (relatively tight execution conditions) because the cassava transcripts available in the NCBI database have been derived from various cassava varieties.
In order to predict the function of the genes encoding the cassava transcripts included in this database, we performed similarity search against various datasets. The following databases were applied for the similarity search: the green plant (Viridiplantae) mRNA sequences (not ESTs) in the NCBI database as typical representatives of the transcript dataset of plants, the green plant (Viridiplantae) protein sequences in the NCBI database (organized by The Arabidopsis Information Resource (TAIR)) and the UniProt/trembl database of the European Bioinformatics Institute (EBI) as the typical representatives of protein dataset of plants, and the protein sequences of castor bean (Ricinus communis; The Institute of Genomic Research (TIGR)), poplar (Populus trichocarpa; Joint Genome Institute (JGI)), grape (Vitis vinifera; International Grape Genome Program (IGGP)), and Arabidopsis thaliana (TAIR) derived from the predicted coding sequences (CDS) and captured full-length cDNAs as the annotated dicot-plant protein dataset.
In order to obtain the gene structure of the cassava transcripts, the cassava transcripts were mapped to the cassava, castor bean, poplar, grape, and Arabidopsis genome sequences. The exon-intron structures and associated genome annotation data have been displayed using the generic genome browser (Gbrowse).
We developed a 60-mer oligonucleotide Agilent microarray representing 21,522 probes by using the transcript sequences (11,422 contigs and 18,214 singlets) in this database.