[Bioc-sig-seq] making target from fasta file

Wed Jun 4 02:42:49 CEST 2008

I downloaded RepeatMasker from the Table Browser:
http://genome.ucsc.edu/cgi-bin/hgTables?command=start
I will try your suggestion
Thank you for your help

On Tue, 03 Jun 2008 17:14:10 -0700
  Herve Pages <hpages at fhcrc.org> wrote:
> Hi Joseph,
> 
> Joseph Dhahbi, P.h.D. wrote:
>> 
>> Hi
>> I downloaded the drosophila RepeatMasker from UCSC GB as 
>>a text file 
>> which is in fasta format and looks like this:
>>> dm3_rmsk_NINJA_I range=chr4:2-434 5'pad=0 3'pad=0 
>>>strand=+ 
>>> repeatMasking=none
>> AATTCGCGTCCGCTTA......
>>> dm3_rmsk_NINJA_LTR range=chr4:435-611 5'pad=0 3'pad=0 
>>>strand=+ 
>>> repeatMasking=none
>> TGTCGCGGATC....
>>> dm3_rmsk_Baggins1 range=chr4:638-1723 5'pad=0 3'pad=0 
>>>strand=- 
>>> repeatMasking=none
>> ATACGATGG......
>> 
>> I made the input dictionary and I would like to make the 
>>RepeatMasker 
>> sequences as the target.  When I used 
>>‘read.DNAStringSet’ it recognized 
>> only the first sequence of the fasts file.  Ho do I 
>>merge all of the 
>> sequences in and make them as a target.
> 
> If your file is really FASTA then read.DNAStringSet() 
>should extract all
> the records and return a DNAStringSet object where each 
>element corresponds
> to a record in the original file. So it seems like 
>you've hit a bug in the
> read.DNAStringSet() function. Can you please provide the 
>URL to the file
> you downloaded so we can try to reproduce?
> 
> Anyway, what you are trying to achieve can be done in an 
>easier (and more
> efficient) way. You don't need to download the 
>RepeatMasker sequences for
> this; just use the BSgenome.Dmelanogaster.UCSC.dm3 
>package. The RepeatMasker
> information is already included in it as part of the 
>built-in masks provided
> for each chromosome:
> 
>   > library(BSgenome.Dmelanogaster.UCSC.dm3)
>   > Dmelanogaster
>   Fly genome
>   |
>   | organism: Drosophila melanogaster
>   | provider: UCSC
>   | provider version: dm3
>   | release date: Apr. 2006
>   | release name: BDGP Release 5
>   |
>   | single sequences (see '?seqnames'):
>   |   chr2L      chr2R      chr3L      chr3R      chr4 
>      chrX       chrU
>   |   chrM       chr2LHet   chr2RHet   chr3LHet 
>  chr3RHet   chrXHet    chrYHet
>   |   chrUextra
>   |
>   | multiple sequences (see '?mseqnames'):
>   |   upstream1000  upstream2000  upstream5000
>   |
>   | (use the '$' or '[[' operator to access a given 
>sequence)
>   > chr2L <- Dmelanogaster$chr2L
>   > chr2L
>     23011544-letter "MaskedDNAString" instance (# for 
>masking)
>   seq: 
>CGACAATGCACGACAGAGGAAGCAGAACAGATATTT...GCATATTTGCAAATTTTGATGAACCCCCCTTTCAAA
>   masks:
>     maskedwidth  maskedratio active 
>                             names
>   1         200 8.691290e-06  FALSE 
>                     assembly gaps
>   2     1966561 8.545976e-02  FALSE 
>                      RepeatMasker
>   3       61603 2.677048e-03  FALSE Tandem Repeats 
>Finder [period<=12]
>   all masks together:
>     maskedwidth maskedratio
>         1988181  0.08639929
>   all active masks together:
>     maskedwidth maskedratio
>               0           0
> 
> Note that the built-in masks are always inactive by 
>default. To activate
> a mask do:
> 
>   > active(masks(chr2L))[2] <- TRUE  # activate the 
>RepeatMasker mask
> 
> Now only the parts of chr2L that are NOT repeat regions 
>are visible.
> To invert this, use gaps():
> 
>   > chr2Lrepeats <- gaps(chr2L)
>   > chr2Lrepeats
>     23011544-letter "MaskedDNAString" instance (# for 
>masking)
>   seq: 
>#GACAATGCACGACAGAGGAAGCAGAACAGATATTT...GCATATTTGCAAATTTT###################
>   masks:
>     maskedwidth maskedratio active
>   1    21044983   0.9145402   TRUE
> 
> Then use matchPDict() (or countPDict()) in the usual 
>way.
> 
> The GenomeSearching vignette in the BSgenome package has 
>more
> information about masking (some sections are still 
>incomplete but
> will be completed soon).
> 
> Hope this helps,
> H.
> 
> 
>> Thank you for your help
>> 
>> 
>> Regards,
>> Joseph
>> 
>> Joseph M. Dhahbi, PhD
>> Childrens Hospital Oakland Research Institute
>> 5700 Martin Luther King Jr. Way
>> Oakland, CA 94609
>> USA
>> Ph.(510)428-3885 EXT.5743
>> Cell.(702)335-0795
>> Fax (510)450-7910
>> jdhahbi at chori.org
>> The email message (and any attachments) is for the 
>>sole...{{dropped:3}}
>> 
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 

Regards,
Joseph

Joseph M. Dhahbi, PhD
Childrens Hospital Oakland Research Institute
5700 Martin Luther King Jr. Way
Oakland, CA 94609
USA
Ph.(510)428-3885 EXT.5743
Cell.(702)335-0795
Fax (510)450-7910
jdhahbi at chori.org
 The email message (and any attachments) is for the sole...{{dropped:3}}