[Bioc-sig-seq] Collapse list of DNAStringSet objects into single object
Martin Morgan
mtmorgan at fhcrc.org
Thu Sep 2 01:22:22 CEST 2010
On 09/01/2010 11:57 AM, Tyler Backman wrote:
> Do you know of a more efficient way to collapse a list of
> DNAStringSet objects into a single DNAStringSet? I'm trying to parse
> an annotated assembly by grabbing the longest contig that hits each
> swissprot gene, where the gene id is in the name of each sequence.
>
> The way I found that works, but is very slow is to convert them to a
> list of character strings, and then back to a DNAStringSet:
>
> longestContigs <- DNAStringSet(sapply(longestContigs, as.character))
Hi Tyler -- c(dna1, dna2, dna3) concatenates the DNAStringSets dna1,
dna2, dna3. If you have a list 'dnas' of DNAStringSets, then
do.call(c, dnas)
and as a reproducible example
fl <- system.file("extdata", "s_1_sequence.txt", package="Biostrings")
dna <- read.DNAStringSet(fl, "fastq")
dnas <- rep(list(dna), 10) ## a list of DNAStringSet objects, so
do.call(c, dnas)
Martin
> sessionInfo()
R version 2.12.0 Under development (unstable) (2010-08-27 r52820)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.17.29 IRanges_1.7.32
loaded via a namespace (and not attached):
[1] Biobase_2.9.0
>
> Here's the full example:
>
> library(Biostrings) contigsWithHits <-
> read.DNAStringSet("transcripts.fa")
>
> # extract only swissprot gene names: geneNames <-
> gsub("^Locus_\\d+_Transcript_\\d+/\\d+_Confidence_[0-9.]+_(.+)$",
> "\\1", names(contigsWithHits), perl=TRUE)
>
> # keep longest from each annotation gene group:
>
> getLongest <- function(contigList){ contigWidth <- width(contigList)
> return(contigList[which.max(contigWidth)]) }
>
> # apply getLongest to each group: longestContigs <-
> tapply(contigsWithHits, geneNames, getLongest) contigNames <-
> sapply(longestContigs, names) # collapse list of DNAStringSet objects
> back into a single DNAStringSet longestContigs <-
> DNAStringSet(sapply(longestContigs, as.character)) # reapply names:
> names(longestContigs) <- contigNames
>
> Sincerely, Tyler William H Backman Cheminformatics Programmer
> Department of Botany and Plant Sciences E-mail:
> tyler.backman at ucr.edu 1207E Genomics Building University of
> California Riverside, CA 92521
>
> _______________________________________________ Bioc-sig-sequencing
> mailing list Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-sig-sequencing
mailing list