[Bioc-sig-seq] Confused on fastq to numeric conversion (SolexaPipeline 1.3+)

Leonardo Collado Torres lcollado at lcg.unam.mx
Thu Nov 26 23:38:57 CET 2009


Dear list,

I'm puzzled over something very simple. I've got some fresh Illumina 
data and I want to make a plot of the median (10% quantile, etc) 
qualities (Y axis) per cycle (X axis).

On the code below, I'm simply reading a test file in fastq format, 
printing the quality, checking that nothing weird went when reading the 
file (could be :P), and changing the ASCII qualities to numeric values. 
The symbol "D" gets changed to 35; the lowest value on the 1st sequence.

 >library(ShortRead)
 >reads <- readFastq(dirPath = ".", pattern = "test.txt")

 > reads[1]
class: ShortReadQ
length: 1 reads; width: 50 cycles
 > sread(reads[1])
  A DNAStringSet instance of length 1
    width seq
[1]    50 ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
 > quality(reads[1])
class: FastqQuality
quality:
  A BStringSet instance of length 1
    width seq
[1]    50 aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a
 > system("head -n 4 test.txt")
@HWUSI-EAS636_0001:8:1:1:1353#0/1
ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
+HWUSI-EAS636_0001:8:1:1:1353#0/1
aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a
 >
 > as(quality(reads[1]), "numeric") # Will use "matrix" when using all 
reads instead of the 1st one
 [1] 64 64 64 63 59 64 64 64 62 63 62 63 64 64 64 64 64 64 61 55 64 63 
58 35 57
[26] 63 63 64 63 63 64 63 63 63 64 63 61 63 64 64 64 50 63 59 60 63 63 
59 59 64
 >
 > summary(as(quality(reads[1]), "numeric"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   35.0    62.0    63.0    61.7    64.0    64.0
 >

Now, what confuses me is that according to 
http://en.wikipedia.org/wiki/FASTQ_format SolexaPipeline 1.3+ "can 
encode a Phred quality score from 0 to 62 using ASCII 64 to 126 
(although in raw read data Phred scores from 0 to 40 only are 
expected)". Then, on the ASCII table at wiki 
(http://en.wikipedia.org/wiki/ASCII) glyph (I called it symbol earlier) 
"D" corresponds to 68 decimal, which should be a Phred quality score of 
6 but its 35 in R. Glyph "a" is 97 decimal, which should be 35 but is 64 
in R.

So, do I substract 29 to all the values I get in R or am I missing 
something?

Thank you and enjoy Thanksgiving if you celebrate it,
Leonardo



 > sessionInfo()
R version 2.10.0 (2009-10-26)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C                 
 [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915   
 [5] LC_MONETARY=C                  LC_MESSAGES=en_US.iso885915  
 [7] LC_PAPER=en_US.iso885915       LC_NAME=C                    
 [9] LC_ADDRESS=C                   LC_TELEPHONE=C               
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C          

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] ShortRead_1.4.0   lattice_0.17-26   BSgenome_1.14.0   Biostrings_2.14.0
[5] IRanges_1.4.0   

loaded via a namespace (and not attached):
[1] Biobase_2.6.0 grid_2.10.0   hwriter_1.1   tools_2.10.0
 >

-- 
Leonardo Collado Torres, Bachelor in Genomic Sciences
Teacher at LCG and member of Dr. Enrique Morett's lab
UNAM Campus Cuernavaca, Mexico

Homepage: http://www.lcg.unam.mx/~lcollado/
Phone: [52] (777) 313-28-05



More information about the Bioc-sig-sequencing mailing list