[Bioc-sig-seq] Confused on fastq to numeric conversion (SolexaPipeline 1.3+)
Leonardo Collado Torres
lcollado at lcg.unam.mx
Fri Nov 27 04:26:20 CET 2009
Hello Martin and Harris,
Thank you for your great, very complete and extremely fast replies!! ^_^
Greetings,
Leonardo
PD Just CCing so the replies get archived on the mailing list for later
newbies like me ;)
Martin Morgan wrote:
> Harris A. Jaffee wrote:
>
>> Perhaps this is a manifestation of those silent choices:
>>
>>
>>> as(quality(reads[1])[[1]], "numeric")
>>>
>> [1] 97 97 97 96 92 97 97 97 95 96 95 96 97 97 97 97 97 97 94 88 97 96
>> 91 68 90
>> [26] 96 96 97 96 96 97 96 96 96 97 96 94 96 97 97 97 83 96 92 93 96 96
>> 92 92 97
>>
>
> It's not really obvious to me why quality(<...>)[[1]] extracts a
> BStringSet. But once it does, the BStringSet is convert to numeric as
> appropriate for a BString, which turns out to be the ASCII encoding.
>
>
>>> as(quality(reads[1]), "numeric")
>>>
>> [1] 64 64 64 63 59 64 64 64 62 63 62 63 64 64 64 64 64 64 61 55 64 63
>> 58 35 57
>> [26] 63 63 64 63 63 64 63 63 63 64 63 61 63 64 64 64 50 63 59 60 63 63
>> 59 59 64
>>
>
> quality(<...>) is a SFastqQuality or FastqQuality object, and is
> converted to numeric as appropriate for short read qualities.
>
>
>> I am less than a novice in ShortRead, so I don't feel adequate enough to
>> post to
>> the world. Barely adequate enough to make a private guess.
>>
>
> You're too modest, Harris.
>
> Martin
>
>
>> On Nov 26, 2009, at 6:13 PM, Martin Morgan wrote:
>>
>>
>>> Hi Leonardo --
>>>
>>> Leonardo Collado Torres wrote:
>>>
>>>> Dear list,
>>>>
>>>> I'm puzzled over something very simple. I've got some fresh Illumina
>>>> data and I want to make a plot of the median (10% quantile, etc)
>>>> qualities (Y axis) per cycle (X axis).
>>>>
>>>> On the code below, I'm simply reading a test file in fastq format,
>>>> printing the quality, checking that nothing weird went when reading the
>>>> file (could be :P), and changing the ASCII qualities to numeric values.
>>>> The symbol "D" gets changed to 35; the lowest value on the 1st sequence.
>>>>
>>>>
>>>>> library(ShortRead)
>>>>> reads <- readFastq(dirPath = ".", pattern = "test.txt")
>>>>>
>>>>> reads[1]
>>>>>
>>>> class: ShortReadQ
>>>> length: 1 reads; width: 50 cycles
>>>>
>>>>> sread(reads[1])
>>>>>
>>>> A DNAStringSet instance of length 1
>>>> width seq
>>>> [1] 50 ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
>>>>
>>>>> quality(reads[1])
>>>>>
>>>> class: FastqQuality
>>>> quality:
>>>> A BStringSet instance of length 1
>>>> width seq
>>>> [1] 50 aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a
>>>>
>>>>> system("head -n 4 test.txt")
>>>>>
>>>> @HWUSI-EAS636_0001:8:1:1:1353#0/1
>>>> ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
>>>> +HWUSI-EAS636_0001:8:1:1:1353#0/1
>>>> aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a
>>>>
>>>>> as(quality(reads[1]), "numeric") # Will use "matrix" when using all
>>>>>
>>>> reads instead of the 1st one
>>>> [1] 64 64 64 63 59 64 64 64 62 63 62 63 64 64 64 64 64 64 61 55 64 63 58
>>>> 35 57
>>>> [26] 63 63 64 63 63 64 63 63 63 64 63 61 63 64 64 64 50 63 59 60 63 63
>>>> 59 59 64
>>>>
>>>>> summary(as(quality(reads[1]), "numeric"))
>>>>>
>>>> Min. 1st Qu. Median Mean 3rd Qu. Max.
>>>> 35.0 62.0 63.0 61.7 64.0 64.0
>>>>
>>>> Now, what confuses me is that according to
>>>> http://en.wikipedia.org/wiki/FASTQ_format SolexaPipeline 1.3+ "can
>>>> encode a Phred quality score from 0 to 62 using ASCII 64 to 126
>>>> (although in raw read data Phred scores from 0 to 40 only are
>>>> expected)". Then, on the ASCII table at wiki
>>>> (http://en.wikipedia.org/wiki/ASCII) glyph (I called it symbol earlier)
>>>> "D" corresponds to 68 decimal, which should be a Phred quality score of
>>>> 6 but its 35 in R. Glyph "a" is 97 decimal, which should be 35 but is 64
>>>> in R.
>>>>
>>> fastq files don't contain enough information to know whether the data
>>> are encoded with ASCII 64 == 1 (Solexa-style) or ASCII 32 == 1
>>> ('classic'-style). There is an argument qualityType that you can use to
>>> specify which, so
>>>
>>> readFastq("test.txt", qualityType="SFastqQuality")
>>>
>>> for Solexa encoding. Note that to read a single file it is no longer
>>> necessary to specify both directory and pattern. And finally that
>>> as(<...>, "matrix") etc convert from the ASCII to integer encoding;
>>> there is still a transform from integer to p value, if that is desired,
>>> and that there are several possibilities for that transformation as well.
>>>
>>> Martin
>>>
>>>
>>>> So, do I substract 29 to all the values I get in R or am I missing
>>>> something?
>>>>
>>>> Thank you and enjoy Thanksgiving if you celebrate it,
>>>> Leonardo
>>>>
>>>>
>>>>
>>>>
>>>>> sessionInfo()
>>>>>
>>>> R version 2.10.0 (2009-10-26)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3]
>>>> LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5]
>>>> LC_MONETARY=C LC_MESSAGES=en_US.iso885915 [7]
>>>> LC_PAPER=en_US.iso885915 LC_NAME=C [9]
>>>> LC_ADDRESS=C LC_TELEPHONE=C [11]
>>>> LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>> other attached packages:
>>>> [1] ShortRead_1.4.0 lattice_0.17-26 BSgenome_1.14.0
>>>> Biostrings_2.14.0
>>>> [5] IRanges_1.4.0
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.6.0 grid_2.10.0 hwriter_1.1 tools_2.10.0
>>>>
>>> --
>>> Martin Morgan
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>
>
>
--
Leonardo Collado Torres, Bachelor in Genomic Sciences
Teacher at LCG and member of Dr. Enrique Morett's lab
UNAM Campus Cuernavaca, Mexico
Homepage: http://www.lcg.unam.mx/~lcollado/
Phone: [52] (777) 313-28-05
More information about the Bioc-sig-sequencing
mailing list