3 Chimaeric ESTs Puzzle

There seems to be something odd happening with the three ESTs:
yk1279h06.5
yk1397e04.5
yk1518a07.5

The first 275 or so bases of these ESTs align perfectly to the first two exons of Y92H12BL.4 while the part after approximately base 275 aligns perfectly to the first two exons of Y92H12BL.1.

After a BLAT alignment, about 53% of the length of these ESTs match with the genome at Y92H12BL.1 but only about 43% of the ESTs match with the genome at Y92H12BL.4, which is why the best match of the ESTs are displayed on the genome viewer at the 5’ end of Y92H12BL.1.

This is not the first EST I have seen that appears to be a chimaera of two nearby gene transcripts, but it is the first time that I have seen it happen three times nearly identically.

It is always possible that the same chimaeric transcript was sequenced three times under different names in error.

There is a large ORF across the full length of these ESTs giving a protein product which has good matches to the first half of “CDK5 regulatory subunit associated protein 1-like” in vertebrates.

`

gi|45361233|ref|NP_989194.1| CDK5 regulatory subunit associated protein 1-like 1 [Xenopus
tropicalis]
gi|38649005|gb|AAH63205.1| CDK5 regulatory subunit associated protein 1-like 1 [Xenopus
tropicalis]
Length=553

Score = 184 bits (467), Expect = 2e-45, Method: Composition-based stats.
Identities = 94/198 (47%), Positives = 135/198 (68%), Gaps = 11/198 (5%)

Query 6 DIEDIVGR—GPVGSRDANE-IKIRTRKQVPKEQQPDDANVDSMVPGVGQKVWVRTWGC 61
DIEDIV P ++A + I R RK+ + Q ++ DS +PG QK+W+RTWGC
Sbjct 11 DIEDIVSATDPKPHDRQNARQNIVPRARKRNKNKIQEEEPPADSTIPGT-QKIWIRTWGC 69

Query 62 SHNTSDSEYMSGLLQQAGYDVVKEPETAQVWVLNSCTVKTPSEQQANNLVVQGQEQGKKI 121
SHN SD EYM+G L GY + ++PE A +W+LNSCTVK+P+E N + + QE KK+
Sbjct 70 SHNNSDGEYMAGQLAAYGYSITEQPEQADLWLLNSCTVKSPAEDHFRNSIKKAQEANKKV 129

Query 122 IMAGCVSQAAPSEPWLQNVSIVGVKQIDRIVEVVGETLKGNKVRLLTRNRPD------AV 175
+++GCV QA P + +++ +SI+GV+QIDR+VEVV ET+KG+ VRLL + + + A
Sbjct 130 VLSGCVPQAQPRQEYMKGLSIIGVQQIDRVVEVVEETIKGHSVRLLGQKKDNGKRLGGAR 189

Query 176 LSLPKMRKNELIEVLSIS 193
L LPK+RKN LIE++SI+
Sbjct 190 LDLPKIRKNPLIEIISIN 207
`

The Y92H12BL.1 protein product matches many species “CDK5 regulatory subunit associated protein 1-like” very well:

`
gi|39598390|emb|CAE69083.1| Hypothetical protein CBG15101 [Caenorhabditis briggsae]
Length=437

Score = 694 bits (1792), Expect = 0.0, Method: Composition-based stats.
Identities = 333/356 (93%), Positives = 345/356 (96%), Gaps = 0/356 (0%)

Query 1 MAGCVSQAAPSEPWLQNVSIVGVKQIDRIVEVVGETLKGNKVRLLTRNRPDAVLSLPKMR 60
MAGCVSQAAPSEPWLQNVSIVGVKQIDRIVEVV ETLKGNKVRLLTRNRPDA+LSLPKMR
Sbjct 80 MAGCVSQAAPSEPWLQNVSIVGVKQIDRIVEVVEETLKGNKVRLLTRNRPDALLSLPKMR 139

Query 61 KNELIEVLSISTGCLNNCTYCKTKMARGDLVSYPLADLVEQARAAFHDEGVKELWLTSED 120
KNELIEVLSISTGCLNNCTYCKTKMARGDLVSYPL DLVEQARAAFHDEGVKELWLTSED
Sbjct 140 KNELIEVLSISTGCLNNCTYCKTKMARGDLVSYPLEDLVEQARAAFHDEGVKELWLTSED 199

Query 121 LGAWGRDIGLVLPDLLRELVKVIPDGSMMRLGMTNPPYILDHLEEIAEILNHPKVYAFLH 180
LGAWGRDI LVLPDLL LVKVIPDG MMRLGMTNPPYILDHLEEIAEILN+PKVYAFLH
Sbjct 200 LGAWGRDINLVLPDLLNALVKVIPDGCMMRLGMTNPPYILDHLEEIAEILNNPKVYAFLH 259

Query 181 IPVQSASDAVLNDMKREYSRRHFEQIADYMIANVPNIYIATDMILAFPTETLEDFEESME 240
IPVQSASDAVL DMKREYSRRHFEQIADYMI NVPNIYIATDMILAFPTETLEDFEESME
Sbjct 260 IPVQSASDAVLTDMKREYSRRHFEQIADYMIENVPNIYIATDMILAFPTETLEDFEESME 319

Query 241 LVRKYKFPSLFINQYYPRSGTPAARLKKIDTVEARKRTAAMSELFRSYTRYTDERIGELH 300
LVRKYKFPSLFINQYYPRSGTPAARLKKIDTVEARKRTAAMSELFRSYTR+T++RIGE+H
Sbjct 320 LVRKYKFPSLFINQYYPRSGTPAARLKKIDTVEARKRTAAMSELFRSYTRFTEDRIGEIH 379

Query 301 RVLVTEVAADKLHGVGHNKSYEQILVPLEYCKMGEWIEVRVTAVTKFSMISKPASI 356
VLVTE+AADK+HGVGHNKSYEQILVPLE+CKMGEWIEVR+T+VTKFSMIS P S+
Sbjct 380 NVLVTEIAADKIHGVGHNKSYEQILVPLEHCKMGEWIEVRITSVTKFSMISTPTSL 435
`

while the Y92H12BL.4 protein product appears to be a merger of two types of protein:

`
gi|39598389|emb|CAE69082.1| Hypothetical protein CBG15100 [Caenorhabditis briggsae]
Length=646

Score = 202 bits (514), Expect = 1e-50, Method: Composition-based stats.
Identities = 119/132 (90%), Positives = 124/132 (93%), Gaps = 3/132 (2%)

Query 131 NEKK—RRGWELLAIVLAFIFPTEAINEKLNEFLNKHLDPIFDLPEVSTSYFSQQCIKR 187
NEK RRGWELL IVLAFIFPTEAI+EKLN+FLNKHLD IFDLPEVSTSYFSQQC+KR
Sbjct 365 NEKPDSLRRGWELLTIVLAFIFPTEAISEKLNDFLNKHLDSIFDLPEVSTSYFSQQCLKR 424

Query 188 LSKVIARLKPSLSTIQETKIHIFRPPLFSASLEELMQMQSEKFpelklpwllttliellY 247
LSKVI RLKPSL +IQETKIHIFRPPLFSASLEELMQMQSEKFPELKLPWLLTTLIELLY
Sbjct 425 LSKVITRLKPSLQSIQETKIHIFRPPLFSASLEELMQMQSEKFPELKLPWLLTTLIELLY 484

Query 248 QSGGRRTEGLFR 259
QSGGRRTEG+FR
Sbjct 485 QSGGRRTEGIFR 496
`

and

`

gi|39598390|emb|CAE69083.1| Hypothetical protein CBG15101 [Caenorhabditis briggsae]
Length=437

Score = 88.2 bits (217), Expect = 3e-16, Method: Composition-based stats.
Identities = 37/41 (90%), Positives = 40/41 (97%), Gaps = 0/41 (0%)

Query 40 DSMVPGVGQKVWVRTWGCSHNTSDSEYMSGLLQQAGYDVVK 80
DSMVPGVGQKVWVRTWGCSHNTSDSEYM+GLL +AGYDV+K
Sbjct 1 DSMVPGVGQKVWVRTWGCSHNTSDSEYMAGLLHKAGYDVLK 41
`

Y92H12BL.4 should almost certainly be split into two separate genes.

There is mass spectroscopy evidence from:
MSP:EQQPDDANVDSMVPGVGQK
on the second exon of Y92H12BL.4 that the genomic region aligned to by the three ESTs produces a protein product.

Gary

The two pieces of these cDNAs are joined at possible splice sites, with the 5’ splice site on .4 being a lousy match to the consensus, but plausible I guess, and the 3’ splice site at the 5’ end of .1 being an excellent splice site. This could be some kind of novel trans-splicing event, which would be very interesting, an artifact of cDNA synthesis, which is made unlikely by the fact that the joins are at splice sites, or an incorrect assembly. How likely is that last possibility?

Sorry I’m so late in joining this thread. We’re going to take a look to see if this is a possible miss-assembly. Another possible source of these apparent chimeric ESTs is the sequenced YAC had a small region deleted. Andy Fire’s group in a recent Genome Research paper noted that when they sequenced nucleosome cores liberated by miccrococcal nuclease some had no match to the elegans genome (nor any other genome) suggesting there might be small deletions in the sequence (so the genome really isn’t finished). Re-sequencing of elegans using 454 technology is underway. This should give an indication of how widespread these small deletions might be.

John