[Bioc-sig-seq] Quality Value Analysis from a BStringSet

Thu Jun 3 22:04:49 CEST 2010

Hi,

On Thu, Jun 3, 2010 at 3:39 PM, Pratap, Abhishek
<APratap at som.umaryland.edu> wrote:
> Hi All
>
> I would like to extract and count the last 5 quality values from the FASTQ file. I have read the file using "readFastq" and have stored the quality values as a BStringSet.
>
> Eg :
> A BStringSet instance of length 5119916
>          width seq
>      [1]    75 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>      [2]    75 bbbbbbbbbbbbabbbbbb`bbbbbbab`b_...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>      [3]    75 aaaaaaa_aaaaO`aa^aaa_a_T_``^[`S...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>      [4]    75 bbbbbbbbbbbbaabbbb`bbb_Uaa___BB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>      [5]    75 ``a`aa`aaYaTaaaBBBBBBBBBBBBBBBB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>
> What I would like to do is subseq the last 5 quality values and do a count on #B. We suspect despite good avg quality we still have HIGH bad bases at the end of reads.
>
> Any other ideas welcome.

How about just plotting the average quality score at each base
position by doing something like:

1. Converting your phred score BStringSet into a matrix of its numeric values
2. Plotting the colMeans(...) of that matrix.

Maybe?

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact