[Bioc-sig-seq] filtering using solexa quality scores

Thu Apr 16 00:44:04 CEST 2009

a simple range of the numerical quality values should usually suffice to 
guess the Illumina/Sanger scale. Sanger is 1 to 40, Solexa is -5 to 
60(?). In any case, if you have negative values you're dealing with a 
Solexa fastq file.

Vincent Carey wrote:
> thanks to both Martin and Cei -- clearly I have to have the scale right,
> and it is my hope to do a bit of analysis of the quality score distributions
> and of decisionmaking using these -- positional effects are clearly of 
> interest.
> 
> On Wed, Apr 15, 2009 at 6:25 PM, Cei Abreu-Goodger <cei at ebi.ac.uk 
> <mailto:cei at ebi.ac.uk>> wrote:
> 
>     Hi Vincent,
> 
>     Are you taking into account that quality scores will tend to drop
>     off towards the end of the run? I would probably restrict any sort
>     of quality filtering to the first x bases of each read... From my
>     experience, only a very small fraction of reads out of a "good" run
>     would be removed due to general quality issues. Also, if your
>     further pipeline is "quality-aware" (eg MAQ/bowtie for alignments)
>     you can get away with not worrying initially about the quality of
>     the reads. On the other hand, for some kinds of analysis I was
>     dropping the quality scores and making plain fasta files. In these
>     cases it would pay off to convert very low-quality bases to Ns,
>     since I would get better coverage.
> 
>     Cheers,
> 
>     Cei
> 
>     Vincent Carey wrote:
> 
>         i have scoured our archives and found little regarding role of
>         solexa
>         quality
>         scores as reported in fastq outputs in short read filtering.
> 
>         my understanding is that a numerical score of -4 or greater
>         indicates more
>         probability
>         mass on the called base than on any other.  in checking 1e6
>         reads on each of
>         two lanes
>         i found the frequency of the event " fewer than three bases have
>         score less
>         than -4" to be
>         4e-3 in one lane and 2e-3 in another.  in other words, filtering by
>         requiring no more than
>         two < -4 scores would take you from a million reads to about
>         2000-4000,
>         assuming i have
>         not taken a biased sample (i may have, just took the first 1e6
>         in fastq).
> 
>         is there any reason to regard a call with score < -4 to be much
>         different
>         from an 'N'?
> 
>                [[alternative HTML version deleted]]
> 
>         _______________________________________________
>         Bioc-sig-sequencing mailing list
>         Bioc-sig-sequencing at r-project.org
>         <mailto:Bioc-sig-sequencing at r-project.org>
>         https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 
> 
>     -- 
>     The Wellcome Trust Sanger Institute is operated by Genome Research
>     Limited, a charity registered in England with number 1021457 and a
>     company registered in England with number 2742969, whose registered
>     office is 215 Euston Road, London, NW1 2BE.
> 
> 
> 
> 
> -- 
> Vincent Carey, PhD
> Biostatistics, Channing Lab
> 617 525 2265

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.