[Bioc-sig-seq] using subseq() in Biostrings to insert bases

Tue Jul 20 22:45:34 CEST 2010

Hi Andrew,

On 07/19/2010 08:11 PM, Andrew Yee wrote:
> Is there a way to use subseq() to insert nucleotides?
>
> Take for example foo:
>
> foo<- DNAStringSet('ACTTA')
>
> I'd like to insert e.g. a G between positions 2 and 3 so that foo looks like
> 'ACGTTA'  Is there a way to do this using subseq()?  Or is an alternative
> function recommended?

Just do:

   > foo <- DNAString('ACTTA')
   > subseq(foo, start=3, end=2) <- DNAString("G")
   > foo
     6-letter "DNAString" instance
   seq: ACGTTA

Generally speaking, if performance is important, you should get better
results by *not* switching back and forth between DNAString/DNAStringSet
objects and regular character vectors. For example, if you want to
replace the nucleotides starting at pos 101 in Human chrY by those in
chrM:

   o Using subseq<-():

     > library(BSgenome.Hsapiens.UCSC.hg19)
     > chrY <- unmasked(Hsapiens$chrY)
     > chrM <- unmasked(Hsapiens$chrM)
     > system.time(subseq(chrY, start=101, width=length(chrM)) <- chrM)
        user  system elapsed
       0.190   0.010   0.193
     > gc()
                used  (Mb) gc trigger  (Mb) max used  (Mb)
     Ncells  1158959  61.9    1835812  98.1  1394696  74.5
     Vcells 15445470 117.9   17364147 132.5 15568274 118.8

   o Using substr():

     > library(BSgenome.Hsapiens.UCSC.hg19)
     > chrY <- unmasked(Hsapiens$chrY)
     > chrM <- unmasked(Hsapiens$chrM)
     > system.time({tmp <- as.character(chrY);
                    tmp2 <- paste(substr(tmp, start=1, stop=100),
                                  as.character(chrM),
                                  substr(tmp, start=101+length(chrM),
                                              stop=length(chrY)),
                                  sep="");
                    chrY <- DNAString(tmp2)})
        user  system elapsed
       1.860   0.230   2.088
     > gc()
                used  (Mb) gc trigger  (Mb) max used  (Mb)
     Ncells  1128874  60.3    1835812  98.1  1368491  73.1
     Vcells 30276832 231.0   35483245 270.8 30284765 231.1

Using subseq<-() is faster and more memory efficient. It's also
more convenient.

Cheers,
H.

>
> Thanks,
> Andrew
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319