The value of transcript based curation - a discussion point

Hi,

It seems that more and more effort is spent by WormBase curators chasing transcript evidence with less and less returns for your time spent. It does not seem that we are discovering new worm genes at any significant rate, rather there is this ever upward trend of adding rarer, and rarer isoforms to exisiting (protein-coding) genes.

The rational for following transcript evidence is clearly a good one and is common in other annotation pipelines (e.g. Ensembl). However, I feel you get to the point where you are essentially struggling to fit evidence from one EST into a new protein coding isoform. An isoform that might be real, but might only occur once in a hundred times and may even represent an accidental splicing event for which an EST was ‘captured’ before Nonsense Mediated Decay processes had time to clear up the errant transcript. As a result of this type of curation, WormBase now contains many isoforms which are supported by only one or two ESTs and which use very poor (i.e. unlikely) splice donor/acceptor sequences. My favourite example of this is rpl-22 (see image below). This gene has 2 ESTs that support a rare isoform which leads to a truncated protein (presumably non-functional) given the nature of the gene, but there are over a hundred ESTs supporting the dominant isoform. Furthermore, the splice acceptor of the shorter isoform is a terrible match to the known splice acceptor consensus.

There is increasing evidence (mostly from microarray experiments) that much of a genome might be transcribed, but not necessarily to encode proteins. E.g see:
http://www.nature.com/nature/journal/v418/n6894/full/418122a.html
http://www.cs.huji.ac.il/csls2003/seminar/saar%20dark%20matter%20in%20the%20genome%20-%20jason%20m%20sohnson%20et%20al.pdf

So my concerns are two-fold. Firstly that there may be more productive/useful avenues to explore curation, rather than always assume that every EST must belong to a protein-coding gene, and secondly that when there are cases of multiple isoforms in WormBase, no attempt is made to clarify or distinguish what is clearly the major isoform and what is possibly a rare transcript, leading to a possible non-functioning protein. Of course I’m assuming that the transcripts are all real, and have been mapped to their correct locations by BLAT…both these assumptions are likely to be questionable for some ESTs.

I know that most people within WormBase already know my views on this, but I weclome discussion with others in the community. Do others out there also feel that some transcription is ‘leaky’ and may not be biologically meaningful?

Regards,

Keith

http://vab.wormbase.org/db/seq/gbrowse_img/wormbase?name=II:5055414..5056317;type=LOCI%3Aoverview+CG+OP+ESTB+mRNAB+OST+file3Asearch_results;width=800;id=47c12437f213e117fbee07abb1b4a5ac;options=ESTB+2;h_feat=Y41E3.10@yellow"

Well, I’m not sure they are chasing… There are still plenty of things that need to be cleaned up, the new batch of ESTs clearly helped with that.
I think they are not particularly chasing for “rare” isoforms.

The rational for following transcript evidence is clearly a good one and is common in other annotation pipelines (e.g. Ensembl). However, I feel you get to the point where you are essentially struggling to fit evidence from one EST into a new protein coding isoform. An isoform that might be real, but might only occur once in a hundred times and may even represent an accidental splicing event for which an EST was 'captured' before Nonsense Mediated Decay processes had time to clear up the errant transcript. As a result of this type of curation, WormBase now contains many isoforms which are supported by only one or two ESTs and which use very poor (i.e. unlikely) splice donor/acceptor sequences. My favourite example of this is rpl-22 (see image below). This gene has 2 ESTs that support a rare isoform which leads to a truncated protein (presumably non-functional) given the nature of the gene, but there are over a hundred ESTs supporting the dominant isoform. Furthermore, the splice acceptor of the shorter isoform is a terrible match to the known splice acceptor consensus.

I think this is quite natural, biological processes do make mistakes, and that will be reflected in ESTs. However, some unusal forms sometimes
do exist, like the examples posted here in this section, and in some cases, rare isoforms are of fundamental importance.
This is not necessarily easy to differentiate.
However, I think it would be very useful for ORF isoforms in WormBase that are based on ESTs to have some kind of indicator,
if they are likely isoforms or unlikely ones, based on EST frequencies and perhaps also taking note of very poor splice sites (which
may not be poor in vivo, if there is a special splice factor that would favour that version)

So my concerns are two-fold. Firstly that there may be more productive/useful avenues to explore curation, rather than always assume that [b]every[/b] EST [b]must[/b] belong to a protein-coding gene, and secondly that when there are cases of multiple isoforms in WormBase, no attempt is made to clarify or distinguish what is clearly the major isoform and what is possibly a rare transcript, leading to a possible non-functioning protein. Of course I'm assuming that the transcripts are all real, and have been mapped to their correct locations by BLAT...both these assumptions are likely to be questionable for some ESTs.

It may be that most users of WormBase are aware of the limitations, and draw the right conclusions.

I know that most people within WormBase already know my views on this, but I weclome discussion with others in the community. Do others out there also feel that some transcription is 'leaky' and may not be biologically meaningful?

As I said, most likely is that every transcript occasionaly doesn’t get spliced properly, so it’s expected that a fraction of ESTs has some errors.

Overall though, I think it would be worthwhile to have some kind of indicaton which ones are the “good” transcripts and which ones
could be aberant ones - (in the absence of other confirmaory evidence).
There is a very good reason for this: many bioinformatic analyses rely on the annotations, and they extract promoter regions etc.
automatically. For these programs it would be essential to be able to throw out the unlikely variants.

thomas

I have met many grad students here at UC Davis (and elsewhere) who sometimes seem overly trusting of gene predictions, just because they are in WormBase. One issue about this is that if you only ever look at the Gene Summary page you don’t really see the differences between isoforms with respect to the numbers of matching ESTs. All isoforms are presented as equally likely.

Overall though, I think it would be worthwhile to have some kind of indicaton which ones are the "good" transcripts and which ones could be aberant ones - (in the absence of other confirmaory evidence). There is a very good reason for this: many bioinformatic analyses rely on the annotations, and they extract promoter regions etc. automatically. For these programs it would be essential to be able to throw out the unlikely variants.

thomas

Regarding the issue of bioinformaticians relying on the annotation, I’ve seen many papers where the authors would say something like “for many genes, multiple isoforms are available and so we chose the longest isoform only”…I guess at some point in future I would like to see people saying ‘we chose the isoform with the most abundant transcript evidence’, but that is only likely to happen if it is clearer (to the community) with respect to the relative expression of different isoforms.

Keith

I have met many grad students here at UC Davis (and elsewhere) who sometimes seem overly trusting of gene predictions, just because they are in WormBase. One issue about this is that if you only ever look at the Gene Summary page you don't really see the differences between isoforms with respect to the numbers of matching ESTs. All isoforms are presented as equally likely.

Ah, yes, something I teach in courses: never accept things in a databases in general as the absolute truth, there can be many levels of errors,
which keep propagating endlessly.
What I sort of meant to say before is that when people work on a gene, they really look all data carefully, and do experiments etc., so they would
notice if some transcripts are unlikely variants. But you are right, not everybody may be careful enough.

Regarding the issue of bioinformaticians relying on the annotation, I've seen many papers where the authors would say something like "for many genes, multiple isoforms are available and so we chose the longest isoform only"...I guess at some point in future I would like to see people saying 'we chose the isoform with the most abundant transcript evidence', but that is only likely to happen if it is clearer (to the community) with respect to the relative expression of different isoforms.

Yes, I think some kind of tag would be useful to identify the major transcript.
I presume you tried to convince the WormBasers, but with no luck. It’s perhaps also an issue of how many transcripts are really
affected by such problems. Do we have any idea? 1%, 5%

Thomas

I agree with you. Lots of trans-splicing mistakes, SL1 or SL2 at inappropriate cis 3’ splice sites have been caught by cDNAs, so I see no reason why a lot of cis-splicing mistakes or failures would not have happened as well. I’m guessing that a fairly high percent of atypical ests represent these frozen mistakes, and they have no functional consequences. On the other hand, there are certainly some that represent real isoforms, so it isn’t clear to me how these should be approached. Perhaps for most genes the major gene model should be all that’s shown, but that gene models represented by rare splicing events could be an clickable track.