[Bioc-sig-seq] complement and reverseComplement from a character vector

Tue Jun 16 11:52:10 CEST 2009

Dear Hervé,

Sorry for answering so late. Thanks for integrating these methods. I  
agree that making them endomorphic is probably the best and most  
intuitive solution.

I've been playing with Biostrings lately on "de-novo" assembly data  
and one important value that one wants to calculate is the so-called  
"N50" value. It is the size of the contig, which addition to the  
cumulative contig size (ordered by decreasing sizes) makes it bigger  
than half the total size of all contigs. Basically it gives an idea of  
the contig size distribution within the data (although, one still  
needs to know the contig size range). I could not find any function  
doing this, so I have written a function to calculate it and I think  
it could be an interesting addition to the Biostrings package. Here is  
the code:

setGeneric("N50",function(x) standardGeneric("N50"))
setMethod("N50","XStringSet",function(x){
   # the order
   x.order<-order(width(x),decreasing=TRUE)
   width(x[x.order])[which(cumsum(width(x[x.order]))>=sum(width(x)/2)) 
[[1]]]
   })

An example (using the non endomorphic functions...):

my.contig<-DNAStringSet(
	sapply(
		sample(c(100:10000),10),
		function(size){
			paste(sample(c("A","C","G","T"),size,replace=TRUE),collapse="")
	})
)
N50(my.contig)

If you find it valuable, I could contribute the documentation.

Best wishes,

---------------------------------------------------------------
Nicolas Delhomme

High Throughput Functional Genomics Center

European Molecular Biology Laboratory

Tel: +49 6221 387 8426
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

P.S.

 > sessionInfo()
R version 2.9.0 (2009-04-17)
i386-apple-darwin8.11.1

locale:
en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.12.0 IRanges_1.2.0     Biobase_2.4.0

loaded via a namespace (and not attached):
[1] tools_2.9.0

On 2 Jun 2009, at 20:36, Hervé Pagès wrote:

> Hi Nicolas,
>
> Nicolas Delhomme wrote:
>> Hi all,
>> As described in the doc for reverse, complement and  
>> reverseComplement, x can be a character vector.
>> When I tried, it turned out that these functions are not  
>> implemented for complement and reverseComplement.
>> There are of some use for me, so I just wrote them up:
>> setMethod("complement", "character",
>>    function(x, ...){
>>            if(length(x)==1) x<-DNAString(x)
>>        else x<-DNAStringSet(x)
>>        complement(x)
>>    }
>> )
>> setMethod("reverseComplement", "character",
>>    function(x, ...){
>>            if(length(x)==1) x<-DNAString(x)
>>            else x<- DNAStringSet(x)
>>            reverseComplement(x)
>>    }
>> )
>> I just post them in case there would be of use for someone else. I  
>> recognize that it does not save much compared to first converting  
>> the character vector into a DNAString or DNAStringSet. At least,  
>> for me, it allows to skip some "input" evaluation test checking  
>> whether I start with a character vector or a DNAString.
>
> Thanks for the feedback!
>
> The reason these method were not defined is that when the input is  
> character
> it's not clear whether it should be treated as DNA or RNA input.  
> However
> choosing to treat it as DNA is probably what the user will want 99%  
> of the
> time so they could indeed be implemented as in your code above. So  
> if the
> input contains the "U" letter, they will simply fail (without trying  
> to
> be smart).
>
> Note that there is no method for BString/BStringSet objects for  
> exactly the
> same reason.
>
> A question subsists though: should these methods return a DNAString/ 
> DNAStringSet
> object or should the result be turned back into an ordinary  
> character vector
> before it's returned to the user? The latter would make these methods
> "endomorphisms" (i.e. the output has the same type as the input)  
> which is
> more consistent with what the other methods do but I'm not against  
> making
> an exception when the input is character (not a big deal as long as  
> this is
> clearly documented). Then if the user really want this result to be  
> character
> then s/he can always apply as.character() to it.
>
> If there are no objections, I will add these methods to the  
> Biostrings package.
>
> Cheers,
> H.
>
>> Best,
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> High Throughput Functional Genomics Center
>> European Molecular Biology Laboratory
>> Tel: +49 6221 387 8426
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> -- 
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319