[Bioc-sig-seq] Finding Restriction sites and gaps

Mon Aug 25 01:42:09 CEST 2008

In general, I am trying to get a handle on the capabilities of BioStrings 
and related packages.

My particular question is whether there is a slick (and efficient) way to 
construct a single object - along the lines of the XStringsViews class - 
that contains info about restriction sites and N-gaps using functions in 
Biostrings or other bioC packages.

Some background:

I am considering retooling a pipeline I have for processing collections of 
retroviral integration sites, and I am weighing using bioConductor 
packages in the new version.

An early step in this pipeline involves finding the nearest restriction 
site given the direction from which a fragment of DNA was sequenced (which 
depends on the orientation/strand of the retroviral construct) and given a 
genomic location( Chromo, Position ).

Currently this is done by putting together a list of the locations of 
restriction sites for an enzyme (by piping the FASTA to a simple filter 
written in C) importing the list of gaps in the FASTA from UCSC's 
annotation database, then doing lookups (ala findInterval) on the lists 
and returning info on whether the upstream site could be found (i.e. there 
was no intervening gap in the FASTA) and where it was.

I see a way to do this with matchPattern(), but it requires a bunch of 
calls and some cleanup afterwards. Specifically, I would do this something 
like this: I'd load up BSgenome.Hsapiens.UCSC.hg18, then use matchPattern( 
"TTAA", Hsapiens[[ chromo.i ]] ) to find MSEI restriction sites (for 
example), then matchPattern("AN", Hsapiens[[ chromo.i ]] ) to find all the 
gaps that follow an 'A', matchPattern("NA", Hsapiens[[ chromo.i ]] ) to 
find all gaps that are terminated by an 'A', and so on. Then using start() 
on each of the objects created, I can put together a data structure like 
the one I described in the previous paragraph and then use findInterval to 
do lookups.

I'd be interested to know if there is a more _direct_ way to construct the 
list of gaps or a unified list of sites and gaps with an annotation to 
say which is which.

Also, I would be interested to know if there is an efficient way to do the 
lookup directly (for a given Chromo, Position, and Strand return the 
restriction site) without the intermediate step of constructing the list 
of all restriction sites. Typically, runs involve from the low thousands 
to tens of thousands of retroviral integration events (read: genomic 
loci), but with higher throughput sequencing just around the corner, I 
expect this will rise. So, a one-liner that takes a couple of seconds to 
run for each event is impractical.

Can I count on start( my.xstringsviews.object ) being unique and in order?

TIA,

Chuck

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901