MySQL data for Wormbase.

Paulie · October 9, 2013, 12:41pm

Hi all,

I am trying to obtain the MySQL data for Wormbase. The most recent document
(last modified on 22 August 2013, at 17:54) that I can find is here:

http://wiki.wormbase.org/index.php/Administration:Installing_WormBase#Installing_Databases_NOT_DONE

concerning the setup of Wormbase has this to say about that

=======================
Installing Databases NOT DONE
Primary database (AceDB)
GFF Sequence feature database (MySQL)

I have setup the AceDB system and can now run it as a server, so I would now like to install
the MySQL databases. Also, I’m not sure what the support databases mentioned in
the same section are for.

I looked here

ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS239/species/

and downloaded and briefly looked at some of the files and could find
nothing like MySQL install scripts.

Am I missing something obvious? Could somebody please point me towards
the MySQL data for Wormbase?

TIA and rgs,

Paul…

mh6 · October 9, 2013, 3:03pm

That page is sorely outdated. It is from before the whole web frontend was rewritten.

The mysql databases were used mainly for GBrowse, as there was/is a step were the GFF files from the FTP site are loaded into it.
And it (running a GBrowse off the files) should still work, even if you probably have to fiddle with the GBrowae configuration file a bit to show the features you want to see.

It might be that there is actually an AWS image somewhere with everything installed, which might make it a bit easier for you (have a look at http://blog.wormbase.org/2011/08/private-instances-of-wormbase-in-10-minutes/)

Paulie · October 9, 2013, 5:04pm

Hi again, and thanks for your input.

From August this year? Compared to the AceDB docco, that’s yesterday

OK… but

Again, OK… are there scripts for these data migrations? The data has to exist somewhere in the Wormbase system and the
only two species I can find mentioned in AceDB are elegans and briggsae.

Well, now I’m a bit lost. Are you telling me that Wormbase is composed of flat files + AceDB?
Or does all the data now go into AceDB? Has a new data storage method been adopted?

Yes, I read this and am not really interested. If all I wanted to do was run Wormbase, that would be
fine, but I want to look at the back end.

I’m also an impoverished student :-\ and don’t want to be squeezing my all-to-fragile Visa.

The data for the different species must be somewhere - I would like to download it and then
gradually set up Wormbase for myself.

If the data is stored in MySQL, I would like to have the scripts which convert the files on
ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS239/species/
to a series of database scripts which I will quite happily take the time to run.

If there has been some radical change in the underlying way Wormbase works, fine. I would
like to get the data into the format that Wormbase uses to store it. Has Wormbase moved
to CouchDB - there’s stuff on github about Ace2CouchDB? Fine - I would appreciate the
scripts for that. Having just launched AceDB - took forever, College busy now - I can
only find reference to the two species mentioned above.

Any further input appreciated.

Rgs,

Paul…

Paulie · October 9, 2013, 8:28pm

Update - I ran xace and on the first window I clicked on Sequences above where the main
window says (one above the other “Sequence”, “Genome_Sequence” “c_DNA_Sequence”
and at random chose AA007715 - it was a sequence from Caenorhabditis malayi and
no other species as far as I can tell.

Am I to therefore assume that all the species data has now been consolidated into
AceDB? This kind of makes sense - there’s approx 35GB of data - for Flybase it’s
55GB.

Could anyone confirm this - although some species appear to be still missing in
the “Classes selector”? This is all v. confusing.

Thanks for any input.

Paul…

mh6 · October 10, 2013, 3:15pm

Hi Paulie,

the acedb database is the “source of all things wormbase” or at least the place things end (excluding our non-core worms who life only as GFF3+fasta files), so it contains all species.

As it is object oriented, if you click on “Sequences” it will show all sequence objects (genomic, ESTs, etc) which should be potentially a few million objects.
I would propose as entry point something like: pick a gene or CDS id from the website and have a look it it … then you can see how the CDS and Transcripts (and other things) show up and how to traverse the object tree.

A lot of objects also have a Species tag/attribute, so you can do some query like that:
open a keyset of all genes and then type into the search box:
Species = “Brugia malayi”

and it should show you all current and former genes of that species
then type into the search box of that result
Live

and it will return all current genes

with the simple query interface you can follow linked objects with a FOLLOW and you can move right with AND NEXT and COUNT give you the number of branches attached

Paulie · October 10, 2013, 7:38pm

Hi again, and I really appreciate your efforts to help me.

This will be my last post on this for a while - I’m going to spend some time familiarising myself
with AceDB and how to use it. I will also read thoroughly any and all docco (User, Admin
and Dev) I can find on the site - maybe I’m guilty of wanting things to “drop into my lap” too much?

There is one one thing worrying me though, when I kick off
the saceserver - it fires up OK, but issues this warning “sh: 1: uuencode: not found”, and
I would like to know if this could be a problem? Is it related to GNU sharutils? I can
request that that be installed here. I would appreciate knowing if this is normal or
have I disabled some functionality? xace appears to work without “burping”.

OK - I’m starting to grasp this (sort of…). When you say “non-core” - is that species
other than the ones here? ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS240/species/
(New release today - redownload, unzipping… yaaay!). Or is there a defined set of
core species - if so, what is it and/or where is it?

Do I have to load the GFF3s and fasta myself? I don’t mind doing this, if I could
know which data I’m potentially duplicating. Into what do I have to load the non-core
data if I wish to use (for example) the GBrowse, Synteny and other tools to look at it?

My interest is in long range gene and chromosome organisation (intra- and inter-species)
and my fundamental motive for all these pesky questions is how can I be sure that I’m looking
at all of the available data - and finding out what data is available where?

What I would like is to be explicitly told: “Alright Paul, the data for elegans, briggsae, remanei,
japonica and suum (or whatever) is in AceDB and the rest is here (i.e. GFF3 and fasta) and we integrate
the two datasets by doing step1, step2… stepn”.

I’m assuming that the elegans and other core species data is duplicated in their respective species data
in the species download section - as well as in the AceDB downloads.

How does WormBase now interact with non-core species data? Does it? Any links or
whatever to an explanation of this would be appreciated.

Thanks again,

Paul…

p.s. really like the forum software - I’ve used a few of these (notably Oracle forums and orafaq) and
this one is leader of the pack - what is it? I did Google but couldn’t find it.

tharris · October 11, 2013, 3:59am

Hi Paulie -

I’ve only read snippets of this thread but really want to caution you. We don’t really support local installations. Why? WormBase is becoming more and more a cloud-based and highly distributed resource. Things are easily collected on a single box, and even if possible isn’t really very perfomant in that configuration.

If you give me some hint as to what you are trying to do, I could probably point you in a direction that is going to be much better than trying to replicate our installation. In fact, our documentation has moved to GoogleDocs and is maintained internally, so even the docs you are working from on the wiki aren’t entirely up-to-date.

The forum software, BTW, is called SMF.

Paulie · October 13, 2013, 6:18am

Hi Todd, and thanks for getting back to me.

In the first instance (pardon the pun) performance is not an issue.

As a biologist, I’m interested in short and long range gene/gene-cluster/chromosomal
organisation and in novel data visualisation methodologies (which is largely why I’m
intested in the Wormbase raw data). I’m also very interested in the effort to attempt
to coordinate the massive accumulation of biological data which, albeit welcome, is causing
problems.

Being a former DBA (mostly Oracle, but also MS SQL Server and others), I have an interest
in HA/HP (esp. clustering and sharding), logging and monitoring - what you cannot measure,
you cannot manage (I’ve seen this problem many times in industry).

As a former programmer, I am interested in F/LOSS offerings and standards based computing
infrastructure - again I’ve seen the chaos which deviation from these causes. Try finding a
solution for your company’s “bespoke”/“sector-specific”/“unique” dev environment at 21:00
(no Google - it’s not public, nobody else around…).

I am fortunate to have a sympathetic project supervisor who tells me that College will
make space available to me. Having returned to study after working, my means are,
ahem…, limited. I would prefer not to have to use Amazon, but rather run the Wormbase
system here.

Google Docs. I went here Sign in - Google Accounts and was invited
to put in my WormBase user details - I used my handle “Paulie” and my forums
password - no joy! I then went to the WormBase.org site - but it won’t allow me to
register - it merely suggests that I login with FaceBook or Google ID. I logged in with
the address (a tcd.ie Google service) I use here, but I can’t get into the docs section.
How can I gain (read only) access to these docs? If I do have to run an Amazon instance,
I would like to have read any docs thoroughly before starting.

Thanks again for your input.

Paul…

MySQL data for Wormbase.

======================= Installing Databases NOT DONE Primary database (AceDB) GFF Sequence feature database (MySQL)

=======================
Installing Databases NOT DONE
Primary database (AceDB)
GFF Sequence feature database (MySQL)