Example script: fetching flanking sequences of variations

Description
A simple script for fetching flanking sequences for all variations at WormBase. You can limit the resulting list by type (ie SNP or allele), by kind (substitution, deletion, etc), and by species (currently just elegans and briggsae). It uses the open-access data mining servers at WormBase - no fuss, no muss!

Please see my directions for how to use these scripts before getting started.

Usage:
`
get_variation_flanks.pl -flank [bp] [options]

Options (default values shown in brackets):
-flank Size of the flank in bp (required)
-species [elegans] or briggsae
-variations Comma-seperated list with possible values of:
SNP, Confirmed_SNP, RFLP, Allele, or [all]
-types Comma-seperated list with possible values of:
Insertion, Deletion, Substition, or [all]

`

Example use cases:
`

Fetch 100 bp on each side of all SNPs that are RFLPs in C. elegans.

get_variation_flanks.pl -flank 100 -species elegans -variations SNP,RFLP

Limit to those that are RFLPs and have been confirmed

get_variation_flanks.pl -flank 100 -species elegans -variations Confirmed_SNP

Get all Variations that are Alleles (ie NOT SNPs) and that are substitutions

get_variation_flanks.pl -flank 100 -species elegans -variations Allele -types Substitution
`
Note that this script works by ORing parameters. If any ONE is true than the variation will be dumped into the output. If you’d like to be more specific, create separate lists by using specific parameters for both “-variations” and “-types”.

Output
Output will be generated in a file called variation_flanks-WS[version]-[species].txt in a tab-delimited format that can be loaded in Excel if desired. The columns are:

  • variation ID
  • species
  • reference sequence
  • kind of variation (Allele, RFLP, SNP, Confirmed_SNP, Predicted_SNP)
  • type of variation (Substitution, Insertion, Deletion, etc)
  • sequence of variation, N2
  • sequence of the mutant (if known); for SNPs, the sequence in the polymorphic strain
  • chromosome start
  • chromosome stop
  • brief context, 15bp per side
  • left flank, in bp as specified on the command line
  • right flank, in bp as specified on the command line
  • full context: left flank, variation sequence, right flank

Obtaining the script
You can download this script using the following command from a terminal:
wget http://toddot.net/projects/wormbase/data_mining_examples/get_variation_flanks.pl OR curl -O http://toddot.net/projects/wormbase/data_mining_examples/get_variation_flanks.pl

Script text follows: get_variation_flanks.pl

`
#!/usr/bin/perl

Author: Todd Harris (harris@cshl.edu)

Please contact me directly if you have questions on how to use this script.

Copyright (@) 2006, Cold Spring Harbor Laboratory

Scriptname: get_variation_flanks.pl

30 Oct 2006

use strict;

use Ace;
use Bio::DB::GFF;
use Getopt::Long;

my ($flank,$species,$desired_types,$desired_variations);
GetOptions(
‘flank=i’ => $flank,
‘species=s’ => $species,
‘variations=s’ => $desired_variations,
‘types=s’ => $desired_types);

$flank or die <<USAGE;

Fetch flanking sequence from a variety of different
types of variations.

Usage:
get_variation_flanks.pl -flank [bp] (options)
where -flank is the desired size of flanking sequence
on either side of the variation.

Options (default values shown in brackets):
-flank Size of the flank in bp (required)
-species [elegans] or briggsae
-variations Comma-seperated list with possible values of:
SNP, Confirmed_SNP, RFLP, Allele, or [all]
-types Comma-seperated list with possible values of:
insertion, deletion, substition, or [all]

Thus, to fetch 100 bp on every side of the C. elegans RFLPs:
flanking_seq_around_variation.pl -flank 100 -type rflp on either side of the variation.

Options (default values shown in brackets):
-flank Size of the flank in bp (required)
-species [elegans] or briggsae
-variations Comma-seperated list with possible values of:
SNP, Confirmed_SNP, RFLP, Allele, or [all]
-types Comma-seperated list with possible values of:
insertion, deletion, substition, or [all]

Thus, to fetch 100 bp on every side of the C. elegans RFLPs:
flanking_seq_around_variation.pl -flank 100 -type rflp
USAGE
;

Which type of variations to include (Variation_type in acedb nomenclature)

my @available_variations = (qw/SNP Confirmed_SNP RFLP Allele/);
my @desired_variations =
(!$desired_variations || $desired_variations eq ‘all’)
? @available_variations
: split(/,/,$desired_variations);
my %desired_variations = map { $_ => 1 } @desired_variations;

Available types of variations (Type_of_mutation in Acedb nomenclature)

my @available_types = (qw/Deletion Insertion Substitution/);
my @desired_types =
(!$desired_types || $desired_types eq ‘all’)
? @available_types
: split(/,/,$desired_types);
my %desired_types = map { $_ => 1 } @desired_types;

$species ||= ‘elegans’;

my $ace = Ace->connect(-host => ‘aceserver.cshl.org’,
-port => ‘2005’) or die “Sorry, I was unable to connect to the remote WormBase database…\n”;
my $version = $ace->status->{database}{version};
my $db =
Bio::DB::GFF->new(-dsn => “dbi:mysql:$species:aceserver.cshl.org”,
-user => ‘anonymous’)
|| die “Couldn;t establish a connection to DSN: $!”;

open OUT,">variation_flanks-$version-$species.txt";

Fetch all variations from the current release of acedb

my @variations = $ace->fetch(-query=>qq{find Variation where Species=C*$species});

my $c;
foreach my $variation (@variations) {
$c++;
if ($c % 10 == 0) {
print STDERR "Dumping variation $c of " . scalar @variations . “…”;
print STDERR -t STDOUT && !$ENV{EMACS} ? “\r” : “\n”;
}

# Exclude those that aren't of the correct kind
# This is potentially a list (ie SNP, RFLP, etc). We must
# check them sequentially, assuming that any one match
# implies inclusion.
next unless (variation_meets_inclusion_criteria([$variation->Variation_type],\%desired_variations));    

# Similarly, let's exclude those that aren't of the right type
next unless (variation_meets_inclusion_criteria([$variation->Type_of_mutation],\%desired_types));

# A nice little hash to store info on the current variation
my %variation;

# Fetch a GFF segment and various parameters relating to the variation
$variation{segment} = $db->segment(-name => $variation);
# Since the whole point of this little exercise is to fetch
# flanking sequence, let's ignore those that do not yet
# have physical coordinates.
next unless $variation{segment};

# Fetch the relative coordinates of the variation.
# This is a bit silly since the coordinates are relative
# to the variation itself.  This is really only 
# useful for insertions and deletions.
$variation{start}     = $variation{segment}->start;
$variation{stop}      = $variation{segment}->stop;

$variation{abs_start} = $variation{segment}->abs_start;
$variation{abs_stop}  = $variation{segment}->abs_stop;

$variation{segment}->absolute(1);    
# The reference sequence (ie: chromosome)
$variation{refseq}   = $variation{segment}->refseq;

# The sequence of the variation
$variation{sequence} = $variation{segment}->dna;

# And finally, some extra information about the variation
$variation{kind}     = join('; ',$variation->Variation_type);
$variation{type}     = join('; ',$variation->Type_of_mutation);
$variation{species}  = $variation->Species;  # Rather redundant since we already know the species

# Try to fetch the mutant / polymorphic sequence if available
my ($type,$wt,$mut) = $variation->Type_of_mutation->row;
$variation{sequence_mutant} = $mut;
    
# Now, create a full flanking segment based on the variation
my %flank;
$flank{segment}   = $db->segment( -name   => $variation{refseq},
			      -start  => $variation{abs_start} - $flank,
			      -stop   => $variation{abs_stop}  + $flank); 
$flank{full_context}  = $flank{segment}->dna;

# Uppercase the SNP in the sequence itself
substr($flank{full_context},$flank,$variation{stop}-$variation{start}+1,
   uc($variation{sequence}));

$flank{brief_context} = substr($flank{full_context},$flank - 15,length($variation{sequence}) + 30); 
$flank{left}          = substr($flank{full_context},0,$flank);
$flank{right}         = substr($flank{full_context},$flank + length($variation{sequence}),$flank);

# Generate an Excel-friendly output
print OUT join("\t",
	   $variation,
	   $variation{species},
	   $variation{refseq},
	   $variation{kind},
	   $variation{type},
	   $variation{sequence},
	   $variation{sequence_mutant},
	   $variation{abs_start},
	   $variation{abs_stop},
	   $flank{brief_context},
	   $flank{left},
	   $flank{right},
	   $flank{full_context}),"\n";    

}

Quickly check to see if a variation or type

is in the wanted list

Pass a list of items to test and hash reference

list of things to test against.

sub variation_meets_inclusion_criteria {
my ($to_check,$desired_list) = @_;
my $good_to_go = 0;

foreach (@$to_check) {
if (defined $desired_list->{$_}) {
    $good_to_go++;
}
}
# Make sure that we meet ALL desired criteria
return 1 if ($good_to_go == scalar keys %$desired_list && $good_to_go > 0);
return 0;

}
`

Isn’t there an or missing in front of the die?

Picky, picky. :slight_smile:

I’m using a logical || instead of or. They perform the same. Unless I’m missing another die statement.