[Bioc-sig-seq] Standard 454 quality checks?

Mon May 4 14:25:05 CEST 2009

2009/5/4 Michael Lawrence <mflawren at fhcrc.org>:
>
>
> On Mon, May 4, 2009 at 2:19 AM, Dan Bolser <dan.bolser at gmail.com> wrote:
>>
>> 2009/4/23 Martin Morgan <mtmorgan at fhcrc.org>:
>> > Dan Bolser <dan.bolser at gmail.com> writes:
>> >
>> >> Sorry for the noob question, but is there a set of standard quality
>> >> checks in R that I can run over some 454 data? I have the fasta and
>> >> the fasta format quality files as well as an sff. I scanned the manual
>> >> for the ShortReads package, but it seems focused on Illumina, I
>> >> couldn't pick out the general bits from the specifics.
>> >
>> > Hi Dan --
>> >
>> > You might be on somewhat uncharted territory here; most of our
>> > experience (though we have some 454 data now) is with Solexa.
>> >
>> > I don't think the standard QA pipeline, along the lines of
>> > report(qa(<...>)), will work at the moment, but I'll try to add that
>> > today.
>> >
>> > You should be able to read the fasta and quality scores with
>> > read454(). This returns a 'ShortReadQ' object, srq, that bundles the
>> > reads, their quality scores, and their ids.
>>
>> Hi Martin, thanks for the info on this. I'm have only just got round
>> to looking at this but I'm a little bit confused about how to read in
>> the reads / qualities.
>>
>> I had expected to say something like:
>>
>> x <- read454(fasta = "my.fas", qual.fasta = "my.qual")
>>
>>
>> or
>>
>> x <- read454(srf = "my.srf")
>>
>>
>> or
>>
>> x <- read454(fastq = "my.fastq")
>>
>>
>>
>> However, this is clearly not correct. "?read454" brings up the
>> "RochePath-class" page that suggests I try something like:
>>
>> x <- RochePath(readPath="./")
>> y<- read454(x)
>>
>>
>> But that fails too... "no input files found / pattern: \.fna$". I
>> tried to set pattern somewhere (to match the ".fas" / and ".qual"
>> patterns of my files), but I couldn't seem to set them anywhere.
>>
>> Can you help with an example of what I am doing wrong?
>
>
> Thanks for this great feedback. One way to achieve what you want would be to
> call readFasta and readQual separately for each file, and then combine the
> results into a ShortReadQ. In the devel version of ShortRead (it might take
> a day to propagate), I just renamed read454 to readFastaQual and added
> separate pattern arguments for the fasta and qual file, so you could do:
> srq <- readFastaQual(".", "my.fas", "my.qual")
>
> How does that sound? I think that this interface could be made more

The above style of invocation sounds great, but I don't see why its not simply,

srq <- readFastaQual("./my.fas", "./my.qual")


> convenient (such as having separate pattern arguments for the extensions and
> then a "base" pattern), but I'm just trying to stay consistent with what is
> ShortRead already.


Its good to be consistent, but I don't see any other R function that
works like this... i.e. when I read.table I don't have to think about
'patterns' of any kind. I realise that this can be convenient when you
have the experimental output directory, but I think its more often
(and more flexible) to just pass the file names directly like any
other R function.

Anyway, as I said originally, I'm very new to ShortRead and BioC in
general, so it could be that I'm just not used to this style of
thinking.

Thanks very much for your help developing these tools.

Since you mentioned that currently things are not that finalised with
regard to 454 checks, I'll include the code that I have been using
below, in the hope that it could be of some small use to you.


All the best,
Dan.



source("~/BiO/Lang/R/readFASTA.R")
source("~/BiO/Lang/R/readQualityFASTA.R")



if(!interactive()){
  fasta <-
    readFASTA("Data/FUBJ29X06.fas")
}

head(fasta,2)
length(fasta)



## Recreate the histogram provided by the sequencing center

## I just like this code ;-)
##plot(table(sapply(fasta, function(l){nchar(l[[2]])})))

## Do the above, but in a way that saves compute time later...

read.lengths <-
  sapply(fasta, function(l){nchar(l[[2]])})

length.table <-
  table(read.lengths)
class(length.table)

plot(as.integer(length.table),
     main="Frequency plot of the read length",
     xlab="Read length",
     ylab="Frequency")



## How many reads does the read with the most number of reads have
(nx <- max(length.table ))

## OK, but which length is it?
(lx <- as.integer(names(length.table[length.table == nx])))

## To be absolutely clear:
nx # number of reads
lx # of this length.







## Now do something with the quality scores
if(!interactive()){
  fasta.qual <-
    readQualityFASTA("Data/FUBJ29X06.qual")
}

head(fasta.qual,2)
length(fasta.qual)



## Plot one example 'quality profile'...
plot(fasta.qual[[6]][[2]], type="l", lwd=2,
     main="Quality score for a read",
     xlab="Base position",
     ylab="Quality score"
     )



## Plot the mean quality score across all the nx reads of length lx
## (see above).

## Get the indexes of the reads of this length
I <-
  which(read.lengths == lx)
length(I)

head(I)
tail(I)

## Just checking ;-)
stopifnot(length(I) == nx)



## Read the qualities of these reads into a matrix

m <- matrix( 0, nrow=nx, ncol=lx )

m[1:6,1:6]

for(i in 1:nx){
  ## Double check the index is giving us the reads we expect
  stopifnot(length( fasta.qual[[ I[i] ]][[2]] ) == lx)

  m[i,] = fasta.qual[[ I[i] ]][[2]]
}

m[1:6,1:6]



## Now we can easily calculate the mean quality over the set of reads

plot(apply(m, 2, mean), type='l', lwd=2,
     ylim=c(0,50),
     main=paste("Mean quality score for the", nx, "reads of length", lx, "bp"),
     xlab="Base position",
     ylab="Mean quality score",
     )

lines(apply(m, 2, mean) +
      apply(m, 2,   sd), type='l',
      ylim=c(0,40), col=2)

lines(apply(m, 2, mean) -
      apply(m, 2,   sd), type='l',
      ylim=c(0,40), col=2)

abline(h=20, col=2, lty=2)

legend(x="topright", inset=0.05, bg="white",
       leg=expression("" %+-% 1 * "s.d."), col=2, lty=1)





## Plot some overall quality score statistics

## Reminder
class(fasta.qual)
class(fasta.qual[[1]])
class(fasta.qual[[1]][[2]])

base.qualities <-
  unlist(lapply(fasta.qual, function(l){l[[2]]}))

length(base.qualities)

## Sanity check
stopifnot(sum(read.lengths) == length(base.qualities))



if(!interactive()){
  ecdf.base.qualities <-
    ecdf(base.qualities)
}

plot(ecdf.base.qualities, verticals = TRUE,
     main="Empirical Cumulative Distribution Function",
     xlab="Quality score",
     ylab="Fraction of bases")

abline(v=20, col=2, lty=2)
abline(h=ecdf.base.qualities(20), col=2, lty=2)

abline(v=35, col=4, lty=4)
abline(h=ecdf.base.qualities(35), col=4, lty=4)



read.qualities <-
  sapply(fasta.qual, function(l){mean(l[[2]])})

hist(read.qualities,
     main="Histogram of mean read quality score",
     xlab="Mean quality score of a read",
     ylab="Number of reads")

## Too dense
#rug(read.qualities)





## Plot mean quality against length

plot(x=read.lengths,
     y=read.qualities)

cor(read.lengths, read.qualities)
cor.test(read.lengths, read.qualities)
cor.test(read.lengths, read.qualities, method="pearson")

(my.lm <- lm(read.qualities~read.lengths))

summary(my.lm)

abline(coef(my.lm), col=2)


## Test that I understand the slope properly

(tx1 <- 100)
(ty1 <- (2.683e+01 + tx1 * 8.974e-03))
points(tx1, ty1, col=3, pch=4, cex=2, lwd=3)

(tx2 <- 400)
(ty2 <- (2.683e+01 + tx2 * 8.974e-03))
points(tx2, ty2, col=3, pch=4, cex=2, lwd=3)





## Lets filter that by read length...

read.length.filter <- read.lengths > 200


## But look at this
(xx <- sum(read.lengths))
(yy <- sum(read.lengths[read.length.filter]))

round((xx - yy) / xx * 100, 1)


base.qualities.filtered <-
  unlist(lapply(fasta.qual[read.length.filter], function(l){l[[2]]}))

length(base.qualities) == xx
length(base.qualities.filtered) == yy

my.breaks <-
  c( 0,
    2,  4,  6,  8, 10,
    12, 14, 16, 18, 20,
    22, 24, 26, 28, 30,
    32, 34, 36, 38, 40)

q.t <- table(base.qualities)
q.f.t <- table(base.qualities.filtered)

length(q.t)
length(q.f.t)



barplot(rbind(q.t, q.f.t), beside=TRUE,
        main="Distribution of base qualities",
        xlab="Quality score",
        ylab="Number of bases")

legend(x="topleft", inset=0.05,
       fill=c(grey(0.4), grey(0.9)),
       leg=c(
         "full set",
         "length filtered")
       )

--------

--- /local/Scratch/dbolser/BiO/Lang/R/readFASTA.R       2008-08-26
12:42:39.000000000 +0100
+++ /local/Scratch/dbolser/BiO/Lang/R/readQualityFASTA.R
2008-08-26 13:11:34.000000000 +0100
@@ -1,5 +1,5 @@

-readFASTA <-
+readQualityFASTA <-
   function (file, checkComments = TRUE, strip.desc = TRUE)
 {
   if (missing(strip.desc))
@@ -37,7 +37,8 @@
     desc <- s1[descriptions[i]]
     if (strip.desc)
       desc <- substr(desc, 2L, nchar(desc))
-    seq <- paste(s1[dp[i]:end[i]], collapse = "")
+    seq <- paste(s1[dp[i]:end[i]], collapse = " ")
+    seq <- as.numeric(strsplit(seq, "\\s+", perl=TRUE)[[1]])
     list(desc = desc, seq = seq)
   })
 }

--------

readFASTA <-
  function (file, checkComments = TRUE, strip.desc = TRUE)
{
  if (missing(strip.desc))
    warning("use 'strip.desc=FALSE' for compatibility with old version\n",
            "  of readFASTA() or 'strip.desc=TRUE' to remove the \">\"\n",
            "  marking the beginning of the description lines and get\n",
            "  rid of this warning (see '?readFASTA' for more details)")
  if (is.character(file)) {
    file <- file(file, "r")
    on.exit(close(file))
  }
  else {
    if (!inherits(file, "connection"))
      stop("'file' must be a character string or connection")
    if (!isOpen(file)) {
      open(file, "r")
      on.exit(close(file))
    }
  }
  s1 <- scan(file = file, what = "", sep = "\n", quote = "",
             allowEscapes = FALSE)
  if (checkComments) {
    comments <- grep("^;", s1)
    if (length(comments) > 0)
      s1 <- s1[-comments]
  }
  descriptions <- which(substr(s1, 1L, 1L) == ">")
  numF <- length(descriptions)
  if (numF == 0)
    stop("no FASTA sequences found")
  dp <- descriptions + 1L
  dm <- descriptions - 1L
  end <- c(dm[-1], length(s1))
  lapply(seq_len(numF), function(i) {
    desc <- s1[descriptions[i]]
    if (strip.desc)
      desc <- substr(desc, 2L, nchar(desc))
    seq <- paste(s1[dp[i]:end[i]], collapse = "")
    list(desc = desc, seq = seq)
  })
}

--------

readQualityFASTA <-
  function (file, checkComments = TRUE, strip.desc = TRUE)
{
  if (missing(strip.desc))
    warning("use 'strip.desc=FALSE' for compatibility with old version\n",
            "  of readFASTA() or 'strip.desc=TRUE' to remove the \">\"\n",
            "  marking the beginning of the description lines and get\n",
            "  rid of this warning (see '?readFASTA' for more details)")
  if (is.character(file)) {
    file <- file(file, "r")
    on.exit(close(file))
  }
  else {
    if (!inherits(file, "connection"))
      stop("'file' must be a character string or connection")
    if (!isOpen(file)) {
      open(file, "r")
      on.exit(close(file))
    }
  }
  s1 <- scan(file = file, what = "", sep = "\n", quote = "",
             allowEscapes = FALSE)
  if (checkComments) {
    comments <- grep("^;", s1)
    if (length(comments) > 0)
      s1 <- s1[-comments]
  }
  descriptions <- which(substr(s1, 1L, 1L) == ">")
  numF <- length(descriptions)
  if (numF == 0)
    stop("no FASTA sequences found")
  dp <- descriptions + 1L
  dm <- descriptions - 1L
  end <- c(dm[-1], length(s1))
  lapply(seq_len(numF), function(i) {
    desc <- s1[descriptions[i]]
    if (strip.desc)
      desc <- substr(desc, 2L, nchar(desc))
    seq <- paste(s1[dp[i]:end[i]], collapse = " ")
    seq <- as.numeric(strsplit(seq, "\\s+", perl=TRUE)[[1]])
    list(desc = desc, seq = seq)
  })
}

--------



> Michael
>
>>
>>
>> How come I need to specify a directory and a pattern when what I
>> really want to do is just to specify a file (because that is what I
>> have)?
>>
>>
>> Thanks for any help.
>>
>> All the best,
>> Dan.
>>
>>
>>
>> > The basic touch points of the qa report for read (i.e., not aligned)
>> > data are numbers of reads, nucleotide frequencies
>> >
>> >  alphabetFrequency(sread(srq), baseOnly=TRUE, collapse=TRUE)
>> >
>> > and cycle-specific alphabet frequencies and average quality scores
>> > (use alphabetByCycle on sread(srq) and quality(srq)). For 454 it seems
>> > like a simple plot of average quality score, along the lines of
>> > alphabetScore(quality(srq)) / width(quality(srq)) against
>> > width(quality(srq)) can also be quite insightful. There might be
>> > issues where the functions expect / it makes sense to do analysis on
>> > uniform-width reads, or on groups of uniformly-widthed reads.
>> >
>> > Sorry for the only limited help.
>> >
>> > Martin
>> >
>> >
>> >> Thanks for any help,
>> >> Dan.
>> >>
>> >> _______________________________________________
>> >> Bioc-sig-sequencing mailing list
>> >> Bioc-sig-sequencing at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>> > --
>> > Martin Morgan
>> > Computational Biology / Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N.
>> > PO Box 19024 Seattle, WA 98109
>> >
>> > Location: Arnold Building M1 B861
>> > Phone: (206) 667-2793
>> >
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>