[Bioc-sig-seq] Calculating/plotting Avg Base Quality

Martin Morgan mtmorgan at fhcrc.org
Mon Aug 3 23:13:14 CEST 2009


Pratap, Abhishek wrote:
> Thanks Martin.
> 
> The reason I don't want to use (qa/report) functions here is because I
> want to display avg base quality scores from two different runs on the
> same graph for the sake of easy visual comparison. In case this is
> possible with (qa/report) please correct me.
> 
> Also you mentioned about quality score conversion, could you please
> elaborate on what the following function is doing mathematically. I
> understand this is a bit off track question.
> 
> FastqQuality(quality(quality(rfq)))

actually, it isn't doing anything mathematically. First I should have
said that rfq came from running example(readFastq).

quality(rfq) extracts the quality scores from rfq; these are of class
'SFastqQuality', with 'S' standing for 'Solexa'.

quality(quality(rfq)) extracts the underlying representation of the
quality scores, now a 'BStringSet'. The BStringSet is exactly the same
as the BStringSet in the SFastqQuality object.

FastqQuality(quality(quality(rfq))) takes that BStringSet and wraps it
up, again unchanged, in a 'FastqQuality' object, which is used to
represent 'phred' quality scores. So you end up with the same underlying
data relabeled (now correctly) as FastqQuality.

The only 'math' that occurs is when you follow Johannes' suggestion, and
convert this to a matrix

  as(FastqQuality(quality(quality(rfq))), "matrix")

where the letters are interpreted in a way consistent with phred encoding.

Incidentally, a different way to arrive at a numerical representation of
the underlying data is with alphabetByCycle(quality(rfq)), which returns
a matrix summarizing how many times each letter occurs per cycle; you
can provide your own interpretation of the letter.

And for two quick diversions, it's interesting how sparse the matrix
returned by alphabetByCycle is -- the encoding schemes are not really
capturing the dynamic range very effectively. Finally, all of the above
assume that reads are the same width. If not, you can make them square
(e.g., with ?"narrow,ShortReadQ-method"), perhaps with careful choice of
arguments to look at the first or last positions (particularly useful
for 454 data).

Martin

Martin

> 
> Thanks,
> -Abhi
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org] 
> Sent: Monday, August 03, 2009 4:41 PM
> To: Pratap, Abhishek
> Cc: Johannes Waage; Bioc-sig-sequencing at r-project.org
> Subject: Re: [Bioc-sig-seq] Calculating/plotting Avg Base Quality
> 
> Pratap, Abhishek wrote:
>> Thanks for a quick revert. Kind of an expected question, when you talk
>> about quality format,  by default is it the Phred one ?  I will be
>> importing the FASTQ from Illumina Eland output which is 0-40 scoring
>> window.  Does that gets automatically converted ?
> 
> fastq files don't contain information on how qualities were encoded;
> readFastq isn't smart about this and assumes they're Solexa. If that's
> not correct, you can change to phred with
> 
>   FastqQuality(quality(quality(rfq)))
> 
> and vice-versa. Also recent Solexa encodings have changed their
> calculation (not their encoding) to use use something closer to phred
> rather than log-odds. This has consequence for low quality scores.
> 
> You might also look at qa() and report() in the ShortRead package, to
> generate a quality report that contains this information. The idea is to
> do this in two stages
> 
>   qastats <- qa(dirPath, pattern) # slow!
>   report(qastats) # fast
> 
> see ?qa and ?report.
> 
> Martin
> 
> 
>>  
>>
>> Thanks,
>>
>> -Abhi
>>
>>  
>>
>> From: johannes.waage at gmail.com [mailto:johannes.waage at gmail.com] On
>> Behalf Of Johannes Waage
>> Sent: Monday, August 03, 2009 4:04 PM
>> To: Pratap, Abhishek; Bioc-sig-sequencing at r-project.org
>> Subject: Re: [Bioc-sig-seq] Calculating/plotting Avg Base Quality
>>
>>  
>>
>> Try this:
>>
>> library(ShortRead)
>> my_reads<- readFastq("~/", pattern="my_reads.fastq")
>> qualities <- FastqQuality(quality(my_reads))
>> qualities <- as(qualities, "matrix")
>> boxplot(as.data.frame((qualities )), outline = F, xlab="Cycle",
>> ylab="Quality")
>>
>> Depending on quality format, you might have to correct your numeric
>> values in the matrix, or even convert to base call probabilities.
>>
>> Regards,
>> Johannes Waage,
>> Uni of copenhagen
>>
>> On Mon, Aug 3, 2009 at 9:41 PM, Pratap, Abhishek
>> <APratap at som.umaryland.edu> wrote:
>>
>> Hi All
>>
>>
>>
>> Just wondering if there is a R function in any package to do the
>> following. Want to make sure before I write something.
>>
>>
>>
>> We would like to calculate avg base quality score for each base called
>> per cycle / lane.  What will be nice is to also plot these avg quality
>> scores/lane  for couple of different runs.
>>
>>
>>
>> Thanks,
>>
>> -Abhi
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>



More information about the Bioc-sig-sequencing mailing list