Getting the regulatory regions for a ton of genes

Hi all! I’ve completed an RNA-seq study and have a list of differentially expressed genes.

How does someone:

  1. get the upstream 2Kb or so of a large list of genes, or even better
  2. the promoter region upstream of the ATG excluding any possible upstream ORF that might be <2kb away, or even better
  3. the upstream area plus any intronic sequences that also might regulate the expression of the genes.

I remember using “Worm-Mart” to do this a thousand years ago

Thanks!

Kris

1 Like

As far as I know, Wormmart is superseded by Wormmine, it might be possible to get sequences from there, but not sure how.

If you’re familiar with R, you can try a script along these lines:

library(wbData)

# for conversion of gene names to WBID
gids <- wb_load_gene_ids(281)

# list of genes of interest
list_of_gene_names <- c("unc-10","rab-3")
list_of_gene_ids <- s2i(list_of_gene_names,
                        geneIDs = gids,
                        warn_missing = TRUE)



# prepare genome
genome_path_zipped <- wb_get_genome_path(281)
genome_path_unzipped <- tools::file_path_sans_ext(genome_path_zipped)

if(!file.exists(genome_path_unzipped)){
  R.utils::gunzip(genome_path_zipped)
}

genome <- Rsamtools::FaFile(genome_path_unzipped)


# prepare list of genes coordinates
txdb <- wb_load_TxDb(281)
#> Loading required package: GenomicFeatures
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'


granges_with_genes_coords <- genes(txdb)[list_of_gene_ids]


# Extract the sequences!
GenomicFeatures::extractUpstreamSeqs(genome,
                                     genes = granges_with_genes_coords,
                                     width = 2000)
#> DNAStringSet object of length 2:
#>     width seq                                               names               
#> [1]  2000 CATTTTTCAAACTTTGTCCAATT...TATTACCTTAAAAATCAAACTTC X
#> [2]  2000 TGAGAAAAAACCACAGGAAAAGA...ATTTATGCGCGAAACAAAAGCCT II

Created on 2023-02-27 with reprex v2.0.2

Here I use my package wbData to download the gtf and fasta files from Wormbase (you could do it manually), then I use the BioConductor function extractUpstreamSeqs to extract the corresponding sequence.

For point 2, that should be doable using BioConductor tools: you can make a for loop on the genes of interest and use a function like findOverlaps() to find ORFs that intersect with your upstream region.

For point 3:

intronic sequences that also might regulate the expression of the genes

how do you define an intron that might regulate the expression, as opposed to another intron? That’s a biological problem more than a programming one. The best I can think of would be to import ATAC-Seq or ChIP-Seq data, and look for peaks in the introns. That becomes a pretty complicated problem if you’re not already comfortable with bioinformatics.

Thanks so much, this is very detailed and super helpful! I’ll drop a line when we are successful.