[Bioc-sig-seq] Finding Restriction sites and gaps
Charles C. Berry
cberry at tajo.ucsd.edu
Mon Aug 25 01:42:09 CEST 2008
In general, I am trying to get a handle on the capabilities of BioStrings
and related packages.
My particular question is whether there is a slick (and efficient) way to
construct a single object - along the lines of the XStringsViews class -
that contains info about restriction sites and N-gaps using functions in
Biostrings or other bioC packages.
Some background:
I am considering retooling a pipeline I have for processing collections of
retroviral integration sites, and I am weighing using bioConductor
packages in the new version.
An early step in this pipeline involves finding the nearest restriction
site given the direction from which a fragment of DNA was sequenced (which
depends on the orientation/strand of the retroviral construct) and given a
genomic location( Chromo, Position ).
Currently this is done by putting together a list of the locations of
restriction sites for an enzyme (by piping the FASTA to a simple filter
written in C) importing the list of gaps in the FASTA from UCSC's
annotation database, then doing lookups (ala findInterval) on the lists
and returning info on whether the upstream site could be found (i.e. there
was no intervening gap in the FASTA) and where it was.
I see a way to do this with matchPattern(), but it requires a bunch of
calls and some cleanup afterwards. Specifically, I would do this something
like this: I'd load up BSgenome.Hsapiens.UCSC.hg18, then use matchPattern(
"TTAA", Hsapiens[[ chromo.i ]] ) to find MSEI restriction sites (for
example), then matchPattern("AN", Hsapiens[[ chromo.i ]] ) to find all the
gaps that follow an 'A', matchPattern("NA", Hsapiens[[ chromo.i ]] ) to
find all gaps that are terminated by an 'A', and so on. Then using start()
on each of the objects created, I can put together a data structure like
the one I described in the previous paragraph and then use findInterval to
do lookups.
I'd be interested to know if there is a more _direct_ way to construct the
list of gaps or a unified list of sites and gaps with an annotation to
say which is which.
Also, I would be interested to know if there is an efficient way to do the
lookup directly (for a given Chromo, Position, and Strand return the
restriction site) without the intermediate step of constructing the list
of all restriction sites. Typically, runs involve from the low thousands
to tens of thousands of retroviral integration events (read: genomic
loci), but with higher throughput sequencing just around the corner, I
expect this will rise. So, a one-liner that takes a couple of seconds to
run for each event is impractical.
Can I count on start( my.xstringsviews.object ) being unique and in order?
TIA,
Chuck
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the Bioc-sig-sequencing
mailing list