The worst genes in WormBase!

kbradnam · December 6, 2006, 11:24pm

Ok, so the title is deliberately provocative but what I want to address is the number of genes in WormBase which don’t look very believable but which will never be removed (under current curation guidelines).

About 20% of all WormBase genes have no transcript evidence at all, and within this set are some that I personally do not believe. For about half of them there is actually evidence against them being real (over 2,000 genes could not be amplified by the Orfeome project and furthermore have no ortholog predicted/detected in C. briggsae). Furthermore, many of this subset have no DNA matches to C. briggsae at all (deduced from lack of WABA alignments).

One example of this is 6R55.2 (http://www.wormbase.org/db/gene/gene?name=WBGene00007069;class=Gene). This short, three-exon gene has fulfils the above criteria and additionally, its first exon is wholly contained within a repeat. The ‘evidence’ for this gene stems from it’s initial genefinder prediction (which is not really evidence) and subsequence protein homology (from BLASTP hits). The best BLASTP hit is to a partial fragment of a whole genome shotgun sequence in a fish and Uniprot has nothing to say at all about this protein. Furthermore, the highest scoring match to the fish protein is actually another C. elegans gene.

Another example is AH10.4 (http://www.wormbase.org/db/gene/gene?name=WBGene00007085;class=Gene). This very short gene is in the intron of another gene on the same strand. There are no abnormal RNAi phenotypes (another trait in common with many of these genes) and the best non-worm match covers just 32 amino acids and has a very low score.

So I believe that there will always be ‘evidence’ from BLAST hits but not evidence that I find believable. Some of these gene predictions have been around for nearly a decade now and have still not had any good evidence produced for their existence. My suggestion would be to remove them from the canonical gene set, remove their proteins from WormPep and maybe introduce a retired/spurious category so that they could appear in a separate track in the genome browser (and maybe exist in a separate blast database). If they are real, then people who work on these genes will probably be swift to let WormBase know!

At some level I think there should be some (evidence-based) justification for every worm gene, or at least a visible scoring system which at least allows people to say ‘yep, this is the worst scoring worm gene’.

Regards,

Keith

gw3 · May 19, 2008, 3:11pm

I agree AH10.4 looks quite unlikely and is a good candidate for being made into a pseudogene,
but AH10.4 has an excellent paralog in C. elegans (which may be its original copy, if it is a pseudogene)

`
AH10.4 1 MHEGPLEIRFLKLFDFPDFCPSKELQELQQRKKRTCCTTEQFLILKFLIF 50
||||||||||||||:|||||||||||||||||||||||||||||||||||
ZK84.4 1 MHEGPLEIRFLKLFNFPDFCPSKELQELQQRKKRTCCTTEQFLILKFLIF 50

AH10.4 51 LKLFTLIKLNLNIPSIPHCQCI 72
||.||||||||.||||||||||
ZK84.4 51 LKFFTLIKLNLIIPSIPHCQCI 72

`

Gary

jspieth · May 19, 2008, 3:22pm

Hi Keith,

Do you have a more extensive list you would like us to look over, or are you suggesting we look at all of the 20% of the genes that have no transcript evidence?

thanks,

John

kbradnam · May 19, 2008, 5:57pm

You’re replying to a post from 1.5 years ago so I don’t think I have any list now. I think working through a subset of predicted genes might be best (i.e. those with least supporting evidence). If there was an evidence score for each transcript, then that might be a lot easier

From the two examples that I raised, I notice that 6R55.2 now has an attached RACE sequence. I don’t know too much about RACE technology, but does the location of the attached 5’ sequence mean that the 5’ end of the annotation gene is wrong?

Regards,

Keith

jspieth · May 19, 2008, 6:30pm

Gary’s reply from earlier today is what caught my attention. Regardless, I think this is an issue that can always stand scrutiny, so please send us other dubious genes as you come across them and we’ll take a look at them. Of course, no evidence is not necessarily evidence for removing a gene. The two examples you gave in your original post have since acquired data indicating that they should probably be annotated at some level.

Regarding your 5’ RACE question; I think it depends on how the RACE experiment was done, but it seems likely the 5’ end is wrong in this case.

Gary, does the curation tool look at RACE tags? If not, maybe it should.

thanks,

John

gw3 · May 20, 2008, 8:34am

John,

No - the curation tool doesn’t check for RACE tags being at the ends of the CDS and it certainly should!

I’ll do this today.

Gary