[Bioc-sig-seq] Quality Value Analysis from a BStringSet

Fri Jun 11 15:55:20 CEST 2010

On 06/11/2010 12:48 PM, Marcus Davy wrote:
> Hi,
> there are a couple of ways to have a go at this, for example pattern match
> for B's and count the matches using low level matching or coerce to numeric
> quality scores and summarize those. I have provided an example where I
>  subset the ShortReadQ object to work with the 3' end of interest. However,
> if you do this, qa()  does not take into account that you may have changed
> the starting cycle position to something other than 1. It would be quite
> nice to have an additional argument  in qa() which sets an alternative start
> cycle, for example if you ran a quality report on the last 20 cycles of
> interest when there are 100 cycles, your qa() summary output should start at
> cycle 81 to 100.

One thing is that the subset might often and reasonably be ragged, e.g.,
'the last 20 nucleotides of a 454 run', where the reads are different
widths.

> 
> Another issue, is that ShortRead:::.plotCycleQuality only plots the means
> for each quality score versus cycle position. This does not provide any
> information about the variability in those quality scores for each cycle
> which usually increases as the cycle number increases.

It's hard to know how to do this when the number of cycles gets large,
without just writing a lot of ink to the plot. Maybe every 5th / 10th
cycle might have quantile information? (with the qa object containing
the info for all cycles).

> I have investigated using a range of quantile scores to provide additional
> distributional information, however this did not visualize particularly
> well.  Boxplots for each cycle position seem to be a better way to summarize
> quality data. Examples of these are created using the fastx toolkit (
> http://hannonlab.cshl.edu/fastx_toolkit/).
> 
> It would be nice to have function that can produce plotCycleQuality
> Boxplots however this does not appear to be straightforward from summary
> information derived from qa(). I would be interested if anyone has had a go
> at doing this, I was thinking of an approach taking relevant quality
> information out of qa() and making use of Rle objects to obtain summary
> quantile statistics and then make boxplots from first principles using a
> lattice panel function based on lattice::panel.bwplot.
> 
> 
> library(ShortRead)
> exptPath   <- system.file("extdata", package = "ShortRead")
> sp         <- SolexaPath(exptPath)
> fqpattern  <- "s_1_sequence.txt"
> fl         <- file.path(analysisPath(sp), fqpattern)
> fq         <- readFastq(sp, fqpattern)
> 
> 
> w <- width(fq[1])
> l <- 5
> ## Subset of fq for the last 5 cycles
> fqs <- new("ShortReadQ", sread = DNAStringSet(sread(fq), (w-l+1), NA),
> quality = SFastqQuality(BStringSet(quality(quality(fq)),
>         (w-l+1), NA)), id = id(fq))

Better to call the constructor ShortReadQ(sread=...) than 'new'
directly. Better still to

  narrow(fq, w-l+1, NA)

(here to narrow an object with variable width reads one might

  narrow(fq, width(fq)-l+1, NA)

> 
> ## Sanity check
> quality(quality(fqs))
> 
> ## Low level matching for pattern B
> pattern <- "B"
> for(i in 1:l) {
>   cat("[ searching for", pattern, "at position", i, "using hasLetterAt ]\n")
>   print(sum(hasLetterAt(quality(quality(fqs)), "X", i)))
>   cat("[ searching for", pattern, "at position", i, "using isMatchingAt
> ]\n")
>   print(sum(isMatchingAt("X", quality(quality(fqs)), i)))
> }
> 
> ## Extracting scores matrix of quality scores
> scores <- as(SFastqQuality(quality(quality(fq))), "matrix")[, (w-l+1):w]

already an SFastqQuality, so just as(quality(fq), "matrix").

Thanks for the ideas. Martin

> summary(scores)
> 
> ## Run a qa report on a subset of fq
> qaSummary  <- qa(fqs, lane="A")
> reportName <- report(qaSummary)
> browseURL(reportName)
> 
> ## perCycle quality plot -note start cycle incorrect
> perCycle <- qaSummary[["perCycle"]]
> ShortRead:::.plotCycleQuality(perCycle$quality)
> 
> 
> 
> Marcus
> 
> 
> On Fri, Jun 4, 2010 at 8:04 AM, Steve Lianoglou <
> mailinglist.honeypot at gmail.com> wrote:
> 
>> Hi,
>>
>> On Thu, Jun 3, 2010 at 3:39 PM, Pratap, Abhishek
>> <APratap at som.umaryland.edu> wrote:
>>> Hi All
>>>
>>> I would like to extract and count the last 5 quality values from the
>> FASTQ file. I have read the file using "readFastq" and have stored the
>> quality values as a BStringSet.
>>>
>>> Eg :
>>> A BStringSet instance of length 5119916
>>>          width seq
>>>      [1]    75
>> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>>>      [2]    75
>> bbbbbbbbbbbbabbbbbb`bbbbbbab`b_...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>>>      [3]    75
>> aaaaaaa_aaaaO`aa^aaa_a_T_``^[`S...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>>>      [4]    75
>> bbbbbbbbbbbbaabbbb`bbb_Uaa___BB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>>>      [5]    75
>> ``a`aa`aaYaTaaaBBBBBBBBBBBBBBBB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>>>
>>> What I would like to do is subseq the last 5 quality values and do a
>> count on #B. We suspect despite good avg quality we still have HIGH bad
>> bases at the end of reads.
>>>
>>> Any other ideas welcome.
>>
>> How about just plotting the average quality score at each base
>> position by doing something like:
>>
>> 1. Converting your phred score BStringSet into a matrix of its numeric
>> values
>> 2. Plotting the colMeans(...) of that matrix.
>>
>> Maybe?
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
> 

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793