[Bioc-sig-seq] filtering using solexa quality scores
Cei Abreu-Goodger
cei at ebi.ac.uk
Thu Apr 16 00:44:04 CEST 2009
a simple range of the numerical quality values should usually suffice to
guess the Illumina/Sanger scale. Sanger is 1 to 40, Solexa is -5 to
60(?). In any case, if you have negative values you're dealing with a
Solexa fastq file.
Vincent Carey wrote:
> thanks to both Martin and Cei -- clearly I have to have the scale right,
> and it is my hope to do a bit of analysis of the quality score distributions
> and of decisionmaking using these -- positional effects are clearly of
> interest.
>
> On Wed, Apr 15, 2009 at 6:25 PM, Cei Abreu-Goodger <cei at ebi.ac.uk
> <mailto:cei at ebi.ac.uk>> wrote:
>
> Hi Vincent,
>
> Are you taking into account that quality scores will tend to drop
> off towards the end of the run? I would probably restrict any sort
> of quality filtering to the first x bases of each read... From my
> experience, only a very small fraction of reads out of a "good" run
> would be removed due to general quality issues. Also, if your
> further pipeline is "quality-aware" (eg MAQ/bowtie for alignments)
> you can get away with not worrying initially about the quality of
> the reads. On the other hand, for some kinds of analysis I was
> dropping the quality scores and making plain fasta files. In these
> cases it would pay off to convert very low-quality bases to Ns,
> since I would get better coverage.
>
> Cheers,
>
> Cei
>
> Vincent Carey wrote:
>
> i have scoured our archives and found little regarding role of
> solexa
> quality
> scores as reported in fastq outputs in short read filtering.
>
> my understanding is that a numerical score of -4 or greater
> indicates more
> probability
> mass on the called base than on any other. in checking 1e6
> reads on each of
> two lanes
> i found the frequency of the event " fewer than three bases have
> score less
> than -4" to be
> 4e-3 in one lane and 2e-3 in another. in other words, filtering by
> requiring no more than
> two < -4 scores would take you from a million reads to about
> 2000-4000,
> assuming i have
> not taken a biased sample (i may have, just took the first 1e6
> in fastq).
>
> is there any reason to regard a call with score < -4 to be much
> different
> from an 'N'?
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
>
>
>
>
> --
> Vincent Carey, PhD
> Biostatistics, Channing Lab
> 617 525 2265
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Bioc-sig-sequencing
mailing list