[Bioc-sig-seq] making target from fasta file

Wed Jun 4 02:14:10 CEST 2008

Hi Joseph,

Joseph Dhahbi, P.h.D. wrote:
> 
> Hi
> I downloaded the drosophila RepeatMasker from UCSC GB as a text file 
> which is in fasta format and looks like this:
>> dm3_rmsk_NINJA_I range=chr4:2-434 5'pad=0 3'pad=0 strand=+ 
>> repeatMasking=none
> AATTCGCGTCCGCTTA......
>> dm3_rmsk_NINJA_LTR range=chr4:435-611 5'pad=0 3'pad=0 strand=+ 
>> repeatMasking=none
> TGTCGCGGATC....
>> dm3_rmsk_Baggins1 range=chr4:638-1723 5'pad=0 3'pad=0 strand=- 
>> repeatMasking=none
> ATACGATGG......
> 
> I made the input dictionary and I would like to make the RepeatMasker 
> sequences as the target.  When I used ‘read.DNAStringSet’ it recognized 
> only the first sequence of the fasts file.  Ho do I merge all of the 
> sequences in and make them as a target.

If your file is really FASTA then read.DNAStringSet() should extract all
the records and return a DNAStringSet object where each element corresponds
to a record in the original file. So it seems like you've hit a bug in the
read.DNAStringSet() function. Can you please provide the URL to the file
you downloaded so we can try to reproduce?

Anyway, what you are trying to achieve can be done in an easier (and more
efficient) way. You don't need to download the RepeatMasker sequences for
this; just use the BSgenome.Dmelanogaster.UCSC.dm3 package. The RepeatMasker
information is already included in it as part of the built-in masks provided
for each chromosome:

   > library(BSgenome.Dmelanogaster.UCSC.dm3)
   > Dmelanogaster
   Fly genome
   |
   | organism: Drosophila melanogaster
   | provider: UCSC
   | provider version: dm3
   | release date: Apr. 2006
   | release name: BDGP Release 5
   |
   | single sequences (see '?seqnames'):
   |   chr2L      chr2R      chr3L      chr3R      chr4       chrX       chrU
   |   chrM       chr2LHet   chr2RHet   chr3LHet   chr3RHet   chrXHet    chrYHet
   |   chrUextra
   |
   | multiple sequences (see '?mseqnames'):
   |   upstream1000  upstream2000  upstream5000
   |
   | (use the '$' or '[[' operator to access a given sequence)
   > chr2L <- Dmelanogaster$chr2L
   > chr2L
     23011544-letter "MaskedDNAString" instance (# for masking)
   seq: CGACAATGCACGACAGAGGAAGCAGAACAGATATTT...GCATATTTGCAAATTTTGATGAACCCCCCTTTCAAA
   masks:
     maskedwidth  maskedratio active                              names
   1         200 8.691290e-06  FALSE                      assembly gaps
   2     1966561 8.545976e-02  FALSE                       RepeatMasker
   3       61603 2.677048e-03  FALSE Tandem Repeats Finder [period<=12]
   all masks together:
     maskedwidth maskedratio
         1988181  0.08639929
   all active masks together:
     maskedwidth maskedratio
               0           0

Note that the built-in masks are always inactive by default. To activate
a mask do:

   > active(masks(chr2L))[2] <- TRUE  # activate the RepeatMasker mask

Now only the parts of chr2L that are NOT repeat regions are visible.
To invert this, use gaps():

   > chr2Lrepeats <- gaps(chr2L)
   > chr2Lrepeats
     23011544-letter "MaskedDNAString" instance (# for masking)
   seq: #GACAATGCACGACAGAGGAAGCAGAACAGATATTT...GCATATTTGCAAATTTT###################
   masks:
     maskedwidth maskedratio active
   1    21044983   0.9145402   TRUE

Then use matchPDict() (or countPDict()) in the usual way.

The GenomeSearching vignette in the BSgenome package has more
information about masking (some sections are still incomplete but
will be completed soon).

Hope this helps,
H.

> Thank you for your help
> 
> 
> Regards,
> Joseph
> 
> Joseph M. Dhahbi, PhD
> Childrens Hospital Oakland Research Institute
> 5700 Martin Luther King Jr. Way
> Oakland, CA 94609
> USA
> Ph.(510)428-3885 EXT.5743
> Cell.(702)335-0795
> Fax (510)450-7910
> jdhahbi at chori.org
> The email message (and any attachments) is for the sole...{{dropped:3}}
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing