[Bioc-sig-seq] Standard 454 quality checks?

Martin Morgan mtmorgan at fhcrc.org
Mon May 4 15:22:51 CEST 2009


Dan Bolser <dan.bolser at gmail.com> writes:

> 2009/5/4 Michael Lawrence <mflawren at fhcrc.org>:
>>
>>
>> On Mon, May 4, 2009 at 2:19 AM, Dan Bolser <dan.bolser at gmail.com> wrote:
>>>
>>> 2009/4/23 Martin Morgan <mtmorgan at fhcrc.org>:
>>> > Dan Bolser <dan.bolser at gmail.com> writes:
>>> >
>>> >> Sorry for the noob question, but is there a set of standard quality
>>> >> checks in R that I can run over some 454 data? I have the fasta and
>>> >> the fasta format quality files as well as an sff. I scanned the manual
>>> >> for the ShortReads package, but it seems focused on Illumina, I
>>> >> couldn't pick out the general bits from the specifics.
>>> >
>>> > Hi Dan --
>>> >
>>> > You might be on somewhat uncharted territory here; most of our
>>> > experience (though we have some 454 data now) is with Solexa.
>>> >
>>> > I don't think the standard QA pipeline, along the lines of
>>> > report(qa(<...>)), will work at the moment, but I'll try to add that
>>> > today.
>>> >
>>> > You should be able to read the fasta and quality scores with
>>> > read454(). This returns a 'ShortReadQ' object, srq, that bundles the
>>> > reads, their quality scores, and their ids.
>>>
>>> Hi Martin, thanks for the info on this. I'm have only just got round
>>> to looking at this but I'm a little bit confused about how to read in
>>> the reads / qualities.
>>>
>>> I had expected to say something like:
>>>
>>> x <- read454(fasta = "my.fas", qual.fasta = "my.qual")

read454 was expecting .fna and .qual as extensions, and then

  rp <- RochePath(experimentPath=".", readPath=".", qualPath=".")

list.files(readPath, "*.fasta") would provide a list of matching
files, and if the first were the one that you were interested in
reading in then

  srq <- read454(rp, sample=1)

would have done that.

More below

>>> or
>>>
>>> x <- read454(srf = "my.srf")
>>>
>>>
>>> or
>>>
>>> x <- read454(fastq = "my.fastq")
>>>
>>>
>>>
>>> However, this is clearly not correct. "?read454" brings up the
>>> "RochePath-class" page that suggests I try something like:
>>>
>>> x <- RochePath(readPath="./")
>>> y<- read454(x)
>>>
>>>
>>> But that fails too... "no input files found / pattern: \.fna$". I
>>> tried to set pattern somewhere (to match the ".fas" / and ".qual"
>>> patterns of my files), but I couldn't seem to set them anywhere.
>>>
>>> Can you help with an example of what I am doing wrong?
>>
>>
>> Thanks for this great feedback. One way to achieve what you want would be to
>> call readFasta and readQual separately for each file, and then combine the
>> results into a ShortReadQ. In the devel version of ShortRead (it might take
>> a day to propagate), I just renamed read454 to readFastaQual and added
>> separate pattern arguments for the fasta and qual file, so you could do:
>> srq <- readFastaQual(".", "my.fas", "my.qual")
>>
>> How does that sound? I think that this interface could be made more
>
> The above style of invocation sounds great, but I don't see why its not simply,
>
> srq <- readFastaQual("./my.fas", "./my.qual")
>
>
>> convenient (such as having separate pattern arguments for the extensions and
>> then a "base" pattern), but I'm just trying to stay consistent with what is
>> ShortRead already.
>
>
> Its good to be consistent, but I don't see any other R function that
> works like this... i.e. when I read.table I don't have to think about
> 'patterns' of any kind. I realise that this can be convenient when you
> have the experimental output directory, but I think its more often
> (and more flexible) to just pass the file names directly like any
> other R function.

The original model was list.files, with one convenience being the
ability to read many files (all those matching 'pattern') into a
single object. Most read* functions that read a single type of file
take a single file name, too, e.g., readAligned(fl,
type="SolexaExport"). 

> Anyway, as I said originally, I'm very new to ShortRead and BioC in
> general, so it could be that I'm just not used to this style of
> thinking.
>
> Thanks very much for your help developing these tools.
>
> Since you mentioned that currently things are not that finalised with
> regard to 454 checks, I'll include the code that I have been using
> below, in the hope that it could be of some small use to you.
>
>
> All the best,
> Dan.
>
>
>
> source("~/BiO/Lang/R/readFASTA.R")
> source("~/BiO/Lang/R/readQualityFASTA.R")
>
>
>
> if(!interactive()){
>   fasta <-
>     readFASTA("Data/FUBJ29X06.fas")
> }

  fasta <- readFasta("Data/FUBJ29X06.fas")

or

  fasta <- sread(srq)

gets you a DNAStringSet, which will have much nicer characteristics.

>
> head(fasta,2)
> length(fasta)
>
>
>
> ## Recreate the histogram provided by the sequencing center
>
> ## I just like this code ;-)
> ##plot(table(sapply(fasta, function(l){nchar(l[[2]])})))

  plot(table(width(fasta))) # or srq

>
> ## Do the above, but in a way that saves compute time later...
>
> read.lengths <-
>   sapply(fasta, function(l){nchar(l[[2]])})
>
> length.table <-
>   table(read.lengths)
> class(length.table)
>
> plot(as.integer(length.table),
>      main="Frequency plot of the read length",
>      xlab="Read length",
>      ylab="Frequency")
>
>
>
> ## How many reads does the read with the most number of reads have
> (nx <- max(length.table ))

  tbls <- tables(fasta) # or tables(srq)
  tbls$top[[1]]

> ## OK, but which length is it?
> (lx <- as.integer(names(length.table[length.table == nx])))

  nchar(names(tbls$top[1]))

> ## To be absolutely clear:
> nx # number of reads
> lx # of this length.
>
>
>
>
>
>
>
> ## Now do something with the quality scores
> if(!interactive()){
>   fasta.qual <-
>     readQualityFASTA("Data/FUBJ29X06.qual")
> }
>
> head(fasta.qual,2)
> length(fasta.qual)
>
>
>
> ## Plot one example 'quality profile'...
> plot(fasta.qual[[6]][[2]], type="l", lwd=2,
>      main="Quality score for a read",
>      xlab="Base position",
>      ylab="Quality score"
>      )
>
>
>
> ## Plot the mean quality score across all the nx reads of length lx
> ## (see above).

  l254 <- srq[width(srq) == 254]
  plot(colMeans(as(quality(l254), "matrix")))

> ## Get the indexes of the reads of this length
> I <-
>   which(read.lengths == lx)
> length(I)
>
> head(I)
> tail(I)
>
> ## Just checking ;-)
> stopifnot(length(I) == nx)
>
>
>
> ## Read the qualities of these reads into a matrix
>
> m <- matrix( 0, nrow=nx, ncol=lx )

  m <- as(quality(l254), "matrix")
>
> m[1:6,1:6]
>
> for(i in 1:nx){
>   ## Double check the index is giving us the reads we expect
>   stopifnot(length( fasta.qual[[ I[i] ]][[2]] ) == lx)
>
>   m[i,] = fasta.qual[[ I[i] ]][[2]]
> }
>
> m[1:6,1:6]
>
>
>
> ## Now we can easily calculate the mean quality over the set of reads
>
> plot(apply(m, 2, mean), type='l', lwd=2,
>      ylim=c(0,50),
>      main=paste("Mean quality score for the", nx, "reads of length", lx, "bp"),
>      xlab="Base position",
>      ylab="Mean quality score",
>      )
>
> lines(apply(m, 2, mean) +
>       apply(m, 2,   sd), type='l',
>       ylim=c(0,40), col=2)
>
> lines(apply(m, 2, mean) -
>       apply(m, 2,   sd), type='l',
>       ylim=c(0,40), col=2)
>
> abline(h=20, col=2, lty=2)
>
> legend(x="topright", inset=0.05, bg="white",
>        leg=expression("" %+-% 1 * "s.d."), col=2, lty=1)
>
>
>
>
>
> ## Plot some overall quality score statistics
>
> ## Reminder
> class(fasta.qual)
> class(fasta.qual[[1]])
> class(fasta.qual[[1]][[2]])
>
> base.qualities <-
>   unlist(lapply(fasta.qual, function(l){l[[2]]}))
>
> length(base.qualities)
>
> ## Sanity check
> stopifnot(sum(read.lengths) == length(base.qualities))
>
>
>
> if(!interactive()){
>   ecdf.base.qualities <-
>     ecdf(base.qualities)
> }
>
> plot(ecdf.base.qualities, verticals = TRUE,
>      main="Empirical Cumulative Distribution Function",
>      xlab="Quality score",
>      ylab="Fraction of bases")
>
> abline(v=20, col=2, lty=2)
> abline(h=ecdf.base.qualities(20), col=2, lty=2)
>
> abline(v=35, col=4, lty=4)
> abline(h=ecdf.base.qualities(35), col=4, lty=4)
>
>
>
> read.qualities <-
>   sapply(fasta.qual, function(l){mean(l[[2]])})

I think this is alphabetScore(quality(srq))

> hist(read.qualities,
>      main="Histogram of mean read quality score",
>      xlab="Mean quality score of a read",
>      ylab="Number of reads")
>
> ## Too dense
> #rug(read.qualities)
>
>
>
>
>
> ## Plot mean quality against length
>
> plot(x=read.lengths,
>      y=read.qualities)
>
> cor(read.lengths, read.qualities)
> cor.test(read.lengths, read.qualities)
> cor.test(read.lengths, read.qualities, method="pearson")
>
> (my.lm <- lm(read.qualities~read.lengths))
>
> summary(my.lm)
>
> abline(coef(my.lm), col=2)
>
>
> ## Test that I understand the slope properly
>
> (tx1 <- 100)
> (ty1 <- (2.683e+01 + tx1 * 8.974e-03))
> points(tx1, ty1, col=3, pch=4, cex=2, lwd=3)
>
> (tx2 <- 400)
> (ty2 <- (2.683e+01 + tx2 * 8.974e-03))
> points(tx2, ty2, col=3, pch=4, cex=2, lwd=3)
>
>
>
>
>
> ## Lets filter that by read length...
>
> read.length.filter <- read.lengths > 200
>
>
> ## But look at this
> (xx <- sum(read.lengths))
> (yy <- sum(read.lengths[read.length.filter]))
>
> round((xx - yy) / xx * 100, 1)
>
>
> base.qualities.filtered <-
>   unlist(lapply(fasta.qual[read.length.filter], function(l){l[[2]]}))
>
> length(base.qualities) == xx
> length(base.qualities.filtered) == yy
>
> my.breaks <-
>   c( 0,
>     2,  4,  6,  8, 10,
>     12, 14, 16, 18, 20,
>     22, 24, 26, 28, 30,
>     32, 34, 36, 38, 40)
>
> q.t <- table(base.qualities)
> q.f.t <- table(base.qualities.filtered)
>
> length(q.t)
> length(q.f.t)
>
>
>
> barplot(rbind(q.t, q.f.t), beside=TRUE,
>         main="Distribution of base qualities",
>         xlab="Quality score",
>         ylab="Number of bases")
>
> legend(x="topleft", inset=0.05,
>        fill=c(grey(0.4), grey(0.9)),
>        leg=c(
>          "full set",
>          "length filtered")
>        )
>
> --------
>
> --- /local/Scratch/dbolser/BiO/Lang/R/readFASTA.R       2008-08-26
> 12:42:39.000000000 +0100
> +++ /local/Scratch/dbolser/BiO/Lang/R/readQualityFASTA.R
> 2008-08-26 13:11:34.000000000 +0100
> @@ -1,5 +1,5 @@
>
> -readFASTA <-
> +readQualityFASTA <-
>    function (file, checkComments = TRUE, strip.desc = TRUE)
>  {
>    if (missing(strip.desc))
> @@ -37,7 +37,8 @@
>      desc <- s1[descriptions[i]]
>      if (strip.desc)
>        desc <- substr(desc, 2L, nchar(desc))
> -    seq <- paste(s1[dp[i]:end[i]], collapse = "")
> +    seq <- paste(s1[dp[i]:end[i]], collapse = " ")
> +    seq <- as.numeric(strsplit(seq, "\\s+", perl=TRUE)[[1]])
>      list(desc = desc, seq = seq)
>    })
>  }
>
> --------
>
> readFASTA <-
>   function (file, checkComments = TRUE, strip.desc = TRUE)
> {
>   if (missing(strip.desc))
>     warning("use 'strip.desc=FALSE' for compatibility with old version\n",
>             "  of readFASTA() or 'strip.desc=TRUE' to remove the \">\"\n",
>             "  marking the beginning of the description lines and get\n",
>             "  rid of this warning (see '?readFASTA' for more details)")
>   if (is.character(file)) {
>     file <- file(file, "r")
>     on.exit(close(file))
>   }
>   else {
>     if (!inherits(file, "connection"))
>       stop("'file' must be a character string or connection")
>     if (!isOpen(file)) {
>       open(file, "r")
>       on.exit(close(file))
>     }
>   }
>   s1 <- scan(file = file, what = "", sep = "\n", quote = "",
>              allowEscapes = FALSE)
>   if (checkComments) {
>     comments <- grep("^;", s1)
>     if (length(comments) > 0)
>       s1 <- s1[-comments]
>   }
>   descriptions <- which(substr(s1, 1L, 1L) == ">")
>   numF <- length(descriptions)
>   if (numF == 0)
>     stop("no FASTA sequences found")
>   dp <- descriptions + 1L
>   dm <- descriptions - 1L
>   end <- c(dm[-1], length(s1))
>   lapply(seq_len(numF), function(i) {
>     desc <- s1[descriptions[i]]
>     if (strip.desc)
>       desc <- substr(desc, 2L, nchar(desc))
>     seq <- paste(s1[dp[i]:end[i]], collapse = "")
>     list(desc = desc, seq = seq)
>   })
> }
>
> --------
>
> readQualityFASTA <-
>   function (file, checkComments = TRUE, strip.desc = TRUE)
> {
>   if (missing(strip.desc))
>     warning("use 'strip.desc=FALSE' for compatibility with old version\n",
>             "  of readFASTA() or 'strip.desc=TRUE' to remove the \">\"\n",
>             "  marking the beginning of the description lines and get\n",
>             "  rid of this warning (see '?readFASTA' for more details)")
>   if (is.character(file)) {
>     file <- file(file, "r")
>     on.exit(close(file))
>   }
>   else {
>     if (!inherits(file, "connection"))
>       stop("'file' must be a character string or connection")
>     if (!isOpen(file)) {
>       open(file, "r")
>       on.exit(close(file))
>     }
>   }
>   s1 <- scan(file = file, what = "", sep = "\n", quote = "",
>              allowEscapes = FALSE)
>   if (checkComments) {
>     comments <- grep("^;", s1)
>     if (length(comments) > 0)
>       s1 <- s1[-comments]
>   }
>   descriptions <- which(substr(s1, 1L, 1L) == ">")
>   numF <- length(descriptions)
>   if (numF == 0)
>     stop("no FASTA sequences found")
>   dp <- descriptions + 1L
>   dm <- descriptions - 1L
>   end <- c(dm[-1], length(s1))
>   lapply(seq_len(numF), function(i) {
>     desc <- s1[descriptions[i]]
>     if (strip.desc)
>       desc <- substr(desc, 2L, nchar(desc))
>     seq <- paste(s1[dp[i]:end[i]], collapse = " ")
>     seq <- as.numeric(strsplit(seq, "\\s+", perl=TRUE)[[1]])
>     list(desc = desc, seq = seq)
>   })
> }
>
> --------
>
>
>
>> Michael
>>
>>>
>>>
>>> How come I need to specify a directory and a pattern when what I
>>> really want to do is just to specify a file (because that is what I
>>> have)?
>>>
>>>
>>> Thanks for any help.
>>>
>>> All the best,
>>> Dan.
>>>
>>>
>>>
>>> > The basic touch points of the qa report for read (i.e., not aligned)
>>> > data are numbers of reads, nucleotide frequencies
>>> >
>>> >  alphabetFrequency(sread(srq), baseOnly=TRUE, collapse=TRUE)
>>> >
>>> > and cycle-specific alphabet frequencies and average quality scores
>>> > (use alphabetByCycle on sread(srq) and quality(srq)). For 454 it seems
>>> > like a simple plot of average quality score, along the lines of
>>> > alphabetScore(quality(srq)) / width(quality(srq)) against
>>> > width(quality(srq)) can also be quite insightful. There might be
>>> > issues where the functions expect / it makes sense to do analysis on
>>> > uniform-width reads, or on groups of uniformly-widthed reads.
>>> >
>>> > Sorry for the only limited help.
>>> >
>>> > Martin
>>> >
>>> >
>>> >> Thanks for any help,
>>> >> Dan.
>>> >>
>>> >> _______________________________________________
>>> >> Bioc-sig-sequencing mailing list
>>> >> Bioc-sig-sequencing at r-project.org
>>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>> >
>>> > --
>>> > Martin Morgan
>>> > Computational Biology / Fred Hutchinson Cancer Research Center
>>> > 1100 Fairview Ave. N.
>>> > PO Box 19024 Seattle, WA 98109
>>> >
>>> > Location: Arnold Building M1 B861
>>> > Phone: (206) 667-2793
>>> >
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list