[Bioc-sig-seq] Collapse list of DNAStringSet objects into single object
Tyler Backman
tbackman at ucr.edu
Wed Sep 1 20:57:46 CEST 2010
Do you know of a more efficient way to collapse a list of DNAStringSet objects into a single DNAStringSet? I'm trying to parse an annotated assembly by grabbing the longest contig that hits each swissprot gene, where the gene id is in the name of each sequence.
The way I found that works, but is very slow is to convert them to a list of character strings, and then back to a DNAStringSet:
longestContigs <- DNAStringSet(sapply(longestContigs, as.character))
Here's the full example:
library(Biostrings)
contigsWithHits <- read.DNAStringSet("transcripts.fa")
# extract only swissprot gene names:
geneNames <- gsub("^Locus_\\d+_Transcript_\\d+/\\d+_Confidence_[0-9.]+_(.+)$", "\\1", names(contigsWithHits), perl=TRUE)
# keep longest from each annotation gene group:
getLongest <- function(contigList){
contigWidth <- width(contigList)
return(contigList[which.max(contigWidth)])
}
# apply getLongest to each group:
longestContigs <- tapply(contigsWithHits, geneNames, getLongest)
contigNames <- sapply(longestContigs, names)
# collapse list of DNAStringSet objects back into a single DNAStringSet
longestContigs <- DNAStringSet(sapply(longestContigs, as.character))
# reapply names:
names(longestContigs) <- contigNames
Sincerely,
Tyler William H Backman
Cheminformatics Programmer
Department of Botany and Plant Sciences
E-mail: tyler.backman at ucr.edu
1207E Genomics Building
University of California
Riverside, CA 92521
More information about the Bioc-sig-sequencing
mailing list