[Bioc-sig-seq] Collapse list of DNAStringSet objects into single object

Thu Sep 2 01:22:22 CEST 2010

On 09/01/2010 11:57 AM, Tyler Backman wrote:
> Do you know of a more efficient way to collapse a list of
> DNAStringSet objects into a single DNAStringSet? I'm trying to parse
> an annotated assembly by grabbing the longest contig that hits each
> swissprot gene, where the gene id is in the name of each sequence.
> 
> The way I found that works, but is very slow is to convert them to a
> list of character strings, and then back to a DNAStringSet:
> 
> longestContigs <- DNAStringSet(sapply(longestContigs, as.character))

Hi Tyler -- c(dna1, dna2, dna3) concatenates the DNAStringSets dna1,
dna2, dna3. If you have a list 'dnas' of DNAStringSets, then

  do.call(c, dnas)

and as a reproducible example

  fl <- system.file("extdata", "s_1_sequence.txt", package="Biostrings")
  dna <- read.DNAStringSet(fl, "fastq")
  dnas <- rep(list(dna), 10) ## a list of DNAStringSet objects, so
  do.call(c, dnas)

Martin

> sessionInfo()
R version 2.12.0 Under development (unstable) (2010-08-27 r52820)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.17.29 IRanges_1.7.32

loaded via a namespace (and not attached):
[1] Biobase_2.9.0

> 
> Here's the full example:
> 
> library(Biostrings) contigsWithHits <-
> read.DNAStringSet("transcripts.fa")
> 
> # extract only swissprot gene names: geneNames <-
> gsub("^Locus_\\d+_Transcript_\\d+/\\d+_Confidence_[0-9.]+_(.+)$",
> "\\1", names(contigsWithHits), perl=TRUE)
> 
> # keep longest from each annotation gene group:
> 
> getLongest <- function(contigList){ contigWidth <- width(contigList) 
> return(contigList[which.max(contigWidth)]) }
> 
> # apply getLongest to each group: longestContigs <-
> tapply(contigsWithHits, geneNames, getLongest) contigNames <-
> sapply(longestContigs, names) # collapse list of DNAStringSet objects
> back into a single DNAStringSet longestContigs <-
> DNAStringSet(sapply(longestContigs, as.character)) # reapply names: 
> names(longestContigs) <- contigNames
> 
> Sincerely, Tyler William H Backman Cheminformatics Programmer 
> Department of Botany and Plant Sciences E-mail:
> tyler.backman at ucr.edu 1207E Genomics Building University of
> California Riverside, CA 92521
> 
> _______________________________________________ Bioc-sig-sequencing
> mailing list Bioc-sig-sequencing at r-project.org 
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793