choosing database as reference for illumina sequencing result

Hi all,

I have generated a cell-specific polyA tailed RNA pool, hoping to profile out the highly enriched transcripts for this cell by illumina sequencing.
The sequencing part is actually done by outside company. But I am not sure which database would be the best as the reference to identify and qualify such cell-specific transcripts.
Would it be better to use the DNA database,
ftp://ftp.sanger.ac.uk/pub/wormbase/live_release/genomes/c_elegans/sequences/dna/c_elegans.WS224.dna.fa.gz
ftp://ftp.sanger.ac.uk/pub/wormbase/live_release/genomes/c_elegans/sequences/dna/c_elegans_masked.WS224.dna.fa.gz
ftp://ftp.sanger.ac.uk/pub/wormbase/live_release/genomes/c_elegans/sequences/dna/c_elegans_softmasked.WS224.dna.fa.gz
Or RNA database,
ftp://ftp.sanger.ac.uk/pub/wormbase/live_release/genomes/c_elegans/sequences/rna/c_elegans.WS224.rna.fa.gz
Or GFF file,
ftp://ftp.sanger.ac.uk/pub/wormbase/live_release/genomes/c_elegans/genome_feature_tables/SUPPLEMENTARY_GFF/mSplicer_transcript.gff

Any suggestion is appreciated~~

Thank you…

The RNA database is our collection of non-coding RNAs like 21U and others. There is ftp://ftp.sanger.ac.uk/pub2/wormbase/live_release/genomes/c_elegans/sequences/protein/wormpep.dna224
which contains only the CDSes.

or another variant: in WormMart, you can dump all transcripts as fasta file

I would do a kind of staggered approach:
a.) check first in the collected transcripts (coding+non-coding) (that should cover all mature transcripts)
b.) then search any that can’t be found in a.) agains the genome. (that should cover unspliced transcripts)
c.) grab the collected C.elegans ESTs from DDBJ/GenBank/ENA in case it is a partially spliced sequence (in the hope that a random partially spliced EST matches yours)

Also I would use either WS220 or WS225, as if you ever want to publish your results and people want to recreate them, at least the files will be around as frozen release and you can put in a reference (searched againes WS220 WormBase transcripts). Else i would add them as supplementary data to the paper.

Aligning to an RNA reference is more straightforward computationally, but be aware that the presence of splicing isoforms in the reference preclude unique assignment of reads that map to exons present in more than one transcript. Unless you’re counting isoforms, I would reverse the analysis order to genome first (an upgapped aligner will map all the single exon + unspliced reads), then RNA reference (for reads that cross splice junctions). Alternatively, you can use the Tuxedo package (Bowtie, Tophat, & Cufflinks) with the reference genome.

Harold

Thanks so much for your suggestions. ;D

I think I could do the following:

  1. choosing the frozen release as reference ftp://ftp.sanger.ac.uk/pub/wormbase/FROZEN_RELEASES/
  2. align with genome sequence to get most information of the reads first and then against RNA reference database for more annotation inforamtion.