how to identify or predict the promoter sequence of a gene in C. elegans?

David_S · August 1, 2015, 9:34pm

I am new to molecular biology and c elegans research, and trying to learn as much as I can. I am reading a couple of papers today and wondering how to identify or predict the promoter sequence of a gene in C. elegans?

For example, the promoter sequence between unc-119 and its upstream gene com-1 is 811bp, but I saw people used 2.6kb upstream of unc-119 as its promoter sequence, part of which is in com-1 ORF.

Another example, the promoter sequence between ocr-4 and its upstream gene Y40C5A.3 is 828bp, but kyEx581[ocr-4::GFP + lin-15(+)] used 4.8kb upstream sequence as its promoter, plus 68aa ORF (total ORF is 756aa). (Copied from paper: For the ocr-4::gfp fusion, 4.8 kb of sequence upstream of the ocr-4 gene as well as DNA encoding the first 68 amino acids of the predicted OCR-4 protein were amplified by PCR and subcloned into the pPD95.75 vector in frame with GFP.)

I hope someone here can instruct me

how to identify the correct length of promoter sequence of a gene?
in ocr-4p case, why 68aa ORF was included in the promoter sequence?

Thank you.

HillelSchwartz · August 1, 2015, 11:59pm

I can’t say much about specific instances, like why exactly 68 aa of ocr-4, but the usual ideas are along the following lines:

how to identify the correct length of promoter sequence of a gene?
a) Up to the next upstream gene (ie coding sequence), when you think this is convenient, ie it’s neither too short nor too long
b) If that’s really, really too short, and they’re on the same strand, maybe it’s an operon. Plan accordingly!
c) If it’s not incredibly short, but it’s short: what’s too short? Good question! Sadly, there’s not a great answer. Most people want at least a couple of kb, especially if by including some of the upstream coding sequence you’re also including large introns, especially if those contain sequence conserved in other Caenorhabditis species.
d) On the other hand, “too long” also happens - a few kb is convenient to PCR, more than 7-10 really isn’t.
e) Basically, most people find they’re happy with a few kb because they get expression, and in a pattern that makes sense to them. It helps if you’ve got a loss of function (or overexpression) phenotype and can rescue (or reproduce) it by sticking your promoter construct onto a cDNA of your gene.
f) There are some other considerations, such as that there might be important transcriptional control elements in introns within your gene, downstream of the start of transcription - or even completely 3’ of your gene, or farther 5’. These things happen, though they don’t seem to be too terribly common, or at least not commonly to be terribly important. If you can find these elements, or if you have a good idea where you are, you can add them, often by moving them 5’ of your current promoter construct. This is why having some sort of assay for successful rescue can be very important - so you have some confidence your expression pattern is good, and so you know whether you’re in one of those cases where you need more than a few kb 5’ of the start of transcription. If your genomic construct rescues and your cDNA expression construct doesn’t, it’s possible you need to add back noncoding pieces from your genomic construct.
g) On the other hand, there are slightly fancier approaches: you can place the gfp open reading frame into the context of an extrachromosomal genomic fragment (fosmid, cosmid, or cloned PCRd genomic fragment), either as a fusion with your open reading frame or replacing it (if the latter, avoid inducing Nonsense Mediated Decay). Or in these days of CRISPR you can knock gfp into the actual genomic locus itself, and in theory have all the native control elements (except maybe any intronic elements you replace, or theoretically even functional elements in the coding sequence). These approaches are considerably more work; you may find them more satisfying.

2)why 68aa ORF was included in the promoter sequence?
To promote good expression of your expression construct, and maybe even some native regulation, keeping not only the initiator ATG but also the first several amino acids of coding sequences isn’t a bad idea. Why they kept almost the whole first exon in this particular instance, I don’t know - it’s considerably more than I’d normally expect (and I haven’t looked at the paper). If you’ve a particular interest in this construct, you should contact the first author or the corresponding author directly.

David_S · August 4, 2015, 5:04pm

Hi Hillel, thank you very much for your instruction! It is very helpful to me.

For the two examples of unc-119 and ocr-4, they happen to have very long upstream genes. What if for the ones who have short upstream genes. A longer promoter sequence for gene A will contain its entire upstream gene B and the entire or part of promoter sequence of this upstream gene B.

In this case, will expression of the upstream gene B affect the function of aim gene A?

Should I design a short promoter sequence for gene A? How short should it be?

Thank you very much for your help.

HillelSchwartz · August 6, 2015, 2:30am

I don’t have a great answer to this question. I’m not sure that expression of the upstream gene B will affect transcription of the targeted gene A, but I suppose it’s possible. I guess if you’re concerned about this, the answer might be to include as much of the regulation of gene B as possible, in the hopes that any regulation of the expression of A by the expression of B will more faithfully reflect what happens in vivo.

Also: if your interest is just in a gfp reporter for gene A you’re probably OK, but if you’re actually looking at gene A function (a rescuing GFP fusion protein, for example) you should introduce a mutation to eliminate gene B function from your transgene - a stop codon or frameshift in its open reading frame, for example.

David_S · August 6, 2015, 5:48am

Great idea, Hillel. Thank you so much! I really appreciate your help!

devdude · August 8, 2015, 3:55pm

Sorry I’m a little late to the party here, but let me chime in my two cents.

The reason it’s been so hard to define promoter regions for C. elegans is due to trans-splicing, which has made it difficult to find the actual transcript start site (TSS) for genes. As Hillel stated, the field used the ATG start and went upstream from there. With defined TSS’s, it would be easier to know how much upstream sequence one should take. For example, if your TSS was indeed right near the ATG start, then you could probably make due with 500bp-1kb. If the TSS is 5kb upstream, then you should have at least 5.5-6kb of sequence. How do you find the TSS? Luckily, 3 worm groups have published lists and you can access. The list that has the most defined but least amount of TSS’s is Barbara Meyer’s group. Andy Fire’s and Julie Ahringer’s groups have more TSS’s but not as well defined. Using these datasets you can predict more precisely your promoter. Also if you have any idea of a transcription factor that regulates your gene of interest already, and modENCODE has a ChIP-Seq dataset for that transcription factor, you could search for peaks along with the TSS’s and make sure you get all the necessary regulatory information.

I took the liberty of looking for TSS’s for your genes of interest. For unc-119 the TSS is right around exon 1, maybe 50 nucleotides +/- from the ATG start. For ocr-4 it’s just upstream of the defined 5’UTR.

One more point - because the genome is quite compact, often times there are regulatory elements within the first intron, especially if it’s the longest intron in the gene. Many researchers will include up to the 2nd exon as promoter sequence, but often do not mutate the ATG start so you would have some AA from your gene of interest attached to the reporter.

I hope this helps!

Steve V

PS If you want to make a GFP fusion and make sure all the regulatory elements are there, use Crispr to recombine GFP into the endogenous locus. If you have an exogenous promoter fusion already made you could compare the pattern from Crispr to it, and that way know if you have all the necessary regulatory elements.

David_S · August 9, 2015, 4:50pm

Hi Steve, many thanks for your input. It is very helpful! I know I have a lot to learn and thanks for pointing me in the right direction.

LaDu · August 10, 2015, 7:01pm

I’m backing Steve on this one. Very often when researchers refer to a “promoter” for a worm gene they are not necessarily referring to the core promoter. The core promoter for a gene is usually strongly associated with the transcriptional start site. For many years the transcriptional start site for most C. elegans genes has been obfuscated due to transsplicing, although some researchers recently were able to get around that and sequence many TSSs: http://wormtss.utgenome.org/browser/

If you’re just interested in recapitulating the expression pattern of a gene with some kind of promoter::reporter fusion, it may be better to get more upstream DNA than just the region around the TSS.

HillelSchwartz · August 10, 2015, 11:17pm

Very often when researchers refer to a "promoter" for a worm gene they are not necessarily referring to the core promoter

Oh, to be sure. I suppose it might be different if you delve into literature specifically aimed at transcriptional control, but otherwise essentially all references to "promoters" in the C. elegans literature don't mean what transcription people mean when they talk about "promoters", they mean a combination of the core promoter and enough upstream transcriptional control elements to get an expression pattern they think is right for the gene.