[Bioc-sig-seq] Confused on fastq to numeric conversion (SolexaPipeline 1.3+)

Martin Morgan mtmorgan at fhcrc.org
Fri Nov 27 00:13:53 CET 2009


Hi Leonardo --

Leonardo Collado Torres wrote:
> Dear list,
> 
> I'm puzzled over something very simple. I've got some fresh Illumina
> data and I want to make a plot of the median (10% quantile, etc)
> qualities (Y axis) per cycle (X axis).
> 
> On the code below, I'm simply reading a test file in fastq format,
> printing the quality, checking that nothing weird went when reading the
> file (could be :P), and changing the ASCII qualities to numeric values.
> The symbol "D" gets changed to 35; the lowest value on the 1st sequence.
> 
>>library(ShortRead)
>>reads <- readFastq(dirPath = ".", pattern = "test.txt")
> 
>> reads[1]
> class: ShortReadQ
> length: 1 reads; width: 50 cycles
>> sread(reads[1])
>  A DNAStringSet instance of length 1
>    width seq
> [1]    50 ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
>> quality(reads[1])
> class: FastqQuality
> quality:
>  A BStringSet instance of length 1
>    width seq
> [1]    50 aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a
>> system("head -n 4 test.txt")
> @HWUSI-EAS636_0001:8:1:1:1353#0/1
> ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
> +HWUSI-EAS636_0001:8:1:1:1353#0/1
> aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a
>>
>> as(quality(reads[1]), "numeric") # Will use "matrix" when using all
> reads instead of the 1st one
> [1] 64 64 64 63 59 64 64 64 62 63 62 63 64 64 64 64 64 64 61 55 64 63 58
> 35 57
> [26] 63 63 64 63 63 64 63 63 63 64 63 61 63 64 64 64 50 63 59 60 63 63
> 59 59 64
>>
>> summary(as(quality(reads[1]), "numeric"))
>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>   35.0    62.0    63.0    61.7    64.0    64.0
>>
> 
> Now, what confuses me is that according to
> http://en.wikipedia.org/wiki/FASTQ_format SolexaPipeline 1.3+ "can
> encode a Phred quality score from 0 to 62 using ASCII 64 to 126
> (although in raw read data Phred scores from 0 to 40 only are
> expected)". Then, on the ASCII table at wiki
> (http://en.wikipedia.org/wiki/ASCII) glyph (I called it symbol earlier)
> "D" corresponds to 68 decimal, which should be a Phred quality score of
> 6 but its 35 in R. Glyph "a" is 97 decimal, which should be 35 but is 64
> in R.

fastq files don't contain enough information to know whether the data
are encoded with ASCII 64 == 1 (Solexa-style) or ASCII 32 == 1
('classic'-style). There is an argument qualityType that you can use to
specify which, so

  readFastq("test.txt", qualityType="SFastqQuality")

for Solexa encoding. Note that to read a single file it is no longer
necessary to specify both directory and pattern. And finally that
as(<...>, "matrix") etc convert from the ASCII to integer encoding;
there is still a transform from integer to p value, if that is desired,
and that there are several possibilities for that transformation as well.

Martin

> 
> So, do I substract 29 to all the values I get in R or am I missing
> something?
> 
> Thank you and enjoy Thanksgiving if you celebrate it,
> Leonardo
> 
> 
> 
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-unknown-linux-gnu
> 
> locale:
> [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C                 [3]
> LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915   [5]
> LC_MONETARY=C                  LC_MESSAGES=en_US.iso885915  [7]
> LC_PAPER=en_US.iso885915       LC_NAME=C                    [9]
> LC_ADDRESS=C                   LC_TELEPHONE=C               [11]
> LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C         
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base   
> other attached packages:
> [1] ShortRead_1.4.0   lattice_0.17-26   BSgenome_1.14.0   Biostrings_2.14.0
> [5] IRanges_1.4.0  
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.0 grid_2.10.0   hwriter_1.1   tools_2.10.0
>>
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list