[BioC] Non ambiguous DNA runs
Harris A. Jaffee
hj at jhu.edu
Sat Apr 14 05:28:07 CEST 2012
This will get you most of the way.
> subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA"
> s = DNAString(subject)
> F = letterFrequencyInSlidingView(s, letters="ACGT", view.width=1)
> Rle(F)
'integer' Rle of length 40 with 6 runs
Lengths: 2 20 1 13 2 2
Values : 0 1 0 1 0 1
On Apr 13, 2012, at 7:17 PM, Marcus Davy wrote:
> Hi,
> I am wanting to find the longest run of non ambiguous DNA sequence (A, C,
> G, T only) from an example DNAString, e.g. sequences from sanger reads.
>
> Is there a simple Biostrings function to do this that I am missing?
>
> An approach I thought of was to use mask() to identify the ambiguous base
> positions, so they (and their converse) can be visualized as a
> XStringViews object, however in extracting the ambiguous base positions
> consensusMatrix() wouldn't work with the baseOnly option if the input was a
> DNAString.
>
>
> ## Ambiguous bases
>
> names(IUPAC_CODE_MAP)[-(1:4)]
>
>
>
> subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA"
>
>
>
> ## Find ambiguous base positions
>
> ## pattern matching outside biostrings
>
> gregexpr("[^A|C|G|T]", subject, perl=TRUE)[[1]]
>
>
>
> ## A DNAString class fails with baseOnly option
>
> try(which(consensusMatrix(DNAString(subject), baseOnly=TRUE)["other",]==1))
>
>
>
> ## Find ambiguous bases, forced to use DNASetingSet(subject)
>
> ambGaps <- which(consensusMatrix(DNAStringSet(subject),
> baseOnly=TRUE)["other",]==1)
>
>
>
> masked <- mask(subject, start=ambGaps, end=ambGaps)
>
> masked
>
>
> Views on a 40-letter BString subject
>
> subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA
>
> views:
>
> start end width
>
> [1] 3 22 20 [CTTGTAAAAACTTACACGAA]
>
> [2] 24 36 13 [ATCGGCAGAGAAG]
>
> [3] 39 40 2 [CA]
>
>
> ## Return longest masked sequence
>
> ind <- which.max(width(masked))
>
> masked[[ind]]
>
> 20-letter "BString" instance
>
> seq: CTTGTAAAAACTTACACGAA
>
>
> thanks,
>
> Marcus
>
>
>> sessionInfo()
>
> R version 2.15.0 (2012-03-30)
>
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
>
> locale:
>
> [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
>
>
> attached base packages:
>
> [1] stats graphics grDevices utils datasets methods base
>
>
> other attached packages:
>
> [1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0
> BiocInstaller_1.4.3
>
>
> loaded via a namespace (and not attached):
>
> [1] stats4_2.15.0 tools_2.15.0
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list