C. briggsae genome files question

WouterBerg · May 24, 2020, 3:30am

Hello worm community!

During the current epidemic lockdown situation, my project work has switched to bioinformatics and I have been picking it up as we went along. We are doing a genome analysis on C. briggsae, but I have a question about the genome (FASTA) files available for this species (full name ‘briggsae_WS195.FASTA’).
If any worm bioinformatician could have a look at my query, I would be very grateful.

My PI has asked me to work with some scripts he wrote about 12 years prior, and they were designed to pick out info from the latest C. briggsae genome files of the time, which was WS195. The genome data is organized by chromosome, with a total of 12 headers:

Chromosomes I to V, and ‘random’ versions of these (written as ‘>chrI’, ‘>chrI_random’ etc.)
chromosome X (>chrX), no random for this one.
uncategorized, I think (>chrUn)

Now, I am trying to use the same scripts with the latest genome files, WS276 ‘c_briggsae.PRJNA10731.WS276.genomic_softmasked’. This newer file is organized slightly differently, with no ‘chr#_random’ category, and no uncategorizeds. Instead, there are a large amount of categories named like this:
‘>cb25’ followed by different codes for each entry.

Now, my question comes in two parts.
First off, does anyone know why the files have been reorganized over the years?
Secondly, what do the ‘cb25…’ codes mean? What does this tell me about the genome data listed in those categories?

I am enjoying getting into the bioinformatics side of things a bit more, but this is a little problem I have not been able to solve for myself. Any help would be greatly appreciated.

mh6 · May 24, 2020, 11:36am

Hia,

we had until WS254 a draft version of the CB4 C.briggsae assembly.In WS54 we updated it to the official one, that is also in INSDC.
You can find the description what changed in the WS254 release notes found here.

Have also a stroll around the species pages themselves, like Directory->C.briggsae->Genome Assemblies , as they show the releases quite well.

C. briggsae assembly - modified presentation

Summary: We have modified our presentation of C. briggsae genome assembly to be consistent with the INSDC.

Details: In previous releases, the C. briggsae reference genome in WormBase was organised into the following sets of top-level sequences:

I, II, III, IV, V, X : The six nuclear chromosomes. The sequence of each of these comprises a list of oriented, ordered supercontigs.
I_random, III_random, IV_random, V_random, X_random : a random linearization of supercontigs that have been associated with a chromosome, but unlocalised / oriented within that chromosome.
un : a random linearisation of all remaining supercontigs that have not been associated with a chromosome.

The International Nucleotide Sequence Database Collaboration do not support the artificial linearization of unlocalised / unplaced sequences. We have therefore changed our representation of the assembly to be consistent with the INSDC assembly “CB4” (http://www.ncbi.nlm.nih.gov/assembly/GCA_000004555.3). The top-level sequences are now:

I, II, III, IV, V, X : the six nuclear chromosomes (exactly as before);
cb25.* : the remaining 361 unlocalised/unplaced scaffolds.

We refer to this new presentation as “CB4” (to be consistent with INSDC), with the previous representation now being referred to as “CB4linearized”. Should users need to convert annotation files between the two representations, we have provided two liftOver chain files:

ftp://ftp.ebi.ac.uk/pub/databases/wormbase/assemblies/c_briggsae/liftOver/CB4ToCB4linearized.over.chain.gz - for converting coordinates from the new representation to the old one;
ftp://ftp.ebi.ac.uk/pub/databases/wormbase/assemblies/c_briggsae/liftOver/CB4linearizedToCB4.over.chain.gz - for converting coordinates from the old representation to to the new one.

WouterBerg · May 25, 2020, 1:36am

Hello mh6,

Thanks for the reply! You were very detailed, and it helped me to understand how and why the change happened within the organization of the genome files.

I have downloaded and opened the liftover chain files. They look fairly straightforward, and I assume there are already tools made by other programmers to apply them. However, I googled around a little, but am only able to find a website that performs a liftover process on human genome files, and another programme which seems more broadly applicable, but is written in Python.

At the moment I only know Perl (and a little bit of C). I had high hopes for Bioperl, but can’t find any liftover modules (although maybe they call it a different name?).
Would you be able to direct me to someplace on the web that has a Liftover application that could make use of the .chain files?

Cheers

WouterBerg · June 22, 2020, 9:36pm

I worked out a custom liftover, and it worked out quite well.

However, I am now a few steps later in the pipeline, and I have come across some stuff I don’t quite understand and would like to understand better. I already had a look around online, and even though there are some extensive summaries and descriptions available for the GFF3 file notations (like this one https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md), I still have some questions. Can anyone offer some help?

I am extracting entries from the latest GFF3 file, sorted into separate files by mRNA, exon and a few others categories. I have two questions. I will include a picture of what I’m looking at.

https://i.imgur.com/rHOer2W.png

For entries listed as exons in the GFF3 file, 3 sources (2nd column) are listed. Some are from Wormbase, some are ‘Wormbase_transposon’, and some are listed as ‘history’. What does this mean? I am especially curious about the difference between Wormbase and history entries, as they often have the same location, but not always.
In the attributes column (last column), there are CBG numbers for each entry. However, what I don’t understand about these CBG numbers is that some have nothing behind them, some have .1 or .2 behind them.

mh6 · June 26, 2020, 9:23am

the history_exons are from history_cdses , which are CDS models that have been retired, but as people might still use them for experiemts, they are in the GFF3 as the special history objects. If you look at the ids you can see bpXXX , where XXX is the wormbase release when they got retired. If you go to their web page on WormBase you can also find some comments and historic cross references.

The transposon_cds are from curated transposons and are separate, as they technically have a different origin than regular CDSes of an organism, they are also connected to transposon features. I heared some arguments to treat them as regular CDSes and include them in the proteome set, but as they are quite often viral in origin are very different and (in my opinion) should be treated separate (a bit like the mitochondria and pseudogenes).