[Bioc-sig-seq] making target from fasta file
Joseph Dhahbi, P.h.D.
jdhahbi at chori.org
Wed Jun 4 02:42:49 CEST 2008
I downloaded RepeatMasker from the Table Browser:
http://genome.ucsc.edu/cgi-bin/hgTables?command=start
I will try your suggestion
Thank you for your help
On Tue, 03 Jun 2008 17:14:10 -0700
Herve Pages <hpages at fhcrc.org> wrote:
> Hi Joseph,
>
> Joseph Dhahbi, P.h.D. wrote:
>>
>> Hi
>> I downloaded the drosophila RepeatMasker from UCSC GB as
>>a text file
>> which is in fasta format and looks like this:
>>> dm3_rmsk_NINJA_I range=chr4:2-434 5'pad=0 3'pad=0
>>>strand=+
>>> repeatMasking=none
>> AATTCGCGTCCGCTTA......
>>> dm3_rmsk_NINJA_LTR range=chr4:435-611 5'pad=0 3'pad=0
>>>strand=+
>>> repeatMasking=none
>> TGTCGCGGATC....
>>> dm3_rmsk_Baggins1 range=chr4:638-1723 5'pad=0 3'pad=0
>>>strand=-
>>> repeatMasking=none
>> ATACGATGG......
>>
>> I made the input dictionary and I would like to make the
>>RepeatMasker
>> sequences as the target. When I used
>>‘read.DNAStringSet’ it recognized
>> only the first sequence of the fasts file. Ho do I
>>merge all of the
>> sequences in and make them as a target.
>
> If your file is really FASTA then read.DNAStringSet()
>should extract all
> the records and return a DNAStringSet object where each
>element corresponds
> to a record in the original file. So it seems like
>you've hit a bug in the
> read.DNAStringSet() function. Can you please provide the
>URL to the file
> you downloaded so we can try to reproduce?
>
> Anyway, what you are trying to achieve can be done in an
>easier (and more
> efficient) way. You don't need to download the
>RepeatMasker sequences for
> this; just use the BSgenome.Dmelanogaster.UCSC.dm3
>package. The RepeatMasker
> information is already included in it as part of the
>built-in masks provided
> for each chromosome:
>
> > library(BSgenome.Dmelanogaster.UCSC.dm3)
> > Dmelanogaster
> Fly genome
> |
> | organism: Drosophila melanogaster
> | provider: UCSC
> | provider version: dm3
> | release date: Apr. 2006
> | release name: BDGP Release 5
> |
> | single sequences (see '?seqnames'):
> | chr2L chr2R chr3L chr3R chr4
> chrX chrU
> | chrM chr2LHet chr2RHet chr3LHet
> chr3RHet chrXHet chrYHet
> | chrUextra
> |
> | multiple sequences (see '?mseqnames'):
> | upstream1000 upstream2000 upstream5000
> |
> | (use the '$' or '[[' operator to access a given
>sequence)
> > chr2L <- Dmelanogaster$chr2L
> > chr2L
> 23011544-letter "MaskedDNAString" instance (# for
>masking)
> seq:
>CGACAATGCACGACAGAGGAAGCAGAACAGATATTT...GCATATTTGCAAATTTTGATGAACCCCCCTTTCAAA
> masks:
> maskedwidth maskedratio active
> names
> 1 200 8.691290e-06 FALSE
> assembly gaps
> 2 1966561 8.545976e-02 FALSE
> RepeatMasker
> 3 61603 2.677048e-03 FALSE Tandem Repeats
>Finder [period<=12]
> all masks together:
> maskedwidth maskedratio
> 1988181 0.08639929
> all active masks together:
> maskedwidth maskedratio
> 0 0
>
> Note that the built-in masks are always inactive by
>default. To activate
> a mask do:
>
> > active(masks(chr2L))[2] <- TRUE # activate the
>RepeatMasker mask
>
> Now only the parts of chr2L that are NOT repeat regions
>are visible.
> To invert this, use gaps():
>
> > chr2Lrepeats <- gaps(chr2L)
> > chr2Lrepeats
> 23011544-letter "MaskedDNAString" instance (# for
>masking)
> seq:
>#GACAATGCACGACAGAGGAAGCAGAACAGATATTT...GCATATTTGCAAATTTT###################
> masks:
> maskedwidth maskedratio active
> 1 21044983 0.9145402 TRUE
>
> Then use matchPDict() (or countPDict()) in the usual
>way.
>
> The GenomeSearching vignette in the BSgenome package has
>more
> information about masking (some sections are still
>incomplete but
> will be completed soon).
>
> Hope this helps,
> H.
>
>
>> Thank you for your help
>>
>>
>> Regards,
>> Joseph
>>
>> Joseph M. Dhahbi, PhD
>> Childrens Hospital Oakland Research Institute
>> 5700 Martin Luther King Jr. Way
>> Oakland, CA 94609
>> USA
>> Ph.(510)428-3885 EXT.5743
>> Cell.(702)335-0795
>> Fax (510)450-7910
>> jdhahbi at chori.org
>> The email message (and any attachments) is for the
>>sole...{{dropped:3}}
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
Regards,
Joseph
Joseph M. Dhahbi, PhD
Childrens Hospital Oakland Research Institute
5700 Martin Luther King Jr. Way
Oakland, CA 94609
USA
Ph.(510)428-3885 EXT.5743
Cell.(702)335-0795
Fax (510)450-7910
jdhahbi at chori.org
The email message (and any attachments) is for the sole...{{dropped:3}}
More information about the Bioc-sig-sequencing
mailing list