[BioC] pileup function in ShortRead package
Martin Morgan
mtmorgan at fhcrc.org
Thu Oct 8 21:33:36 CEST 2009
Hi Nora --
Nora Rieber wrote:
> Hi there,
>
> I'm using the pileup function in the ShortRead package and it seems I
> call it with the wrong arguments, or that there is some kind of bug in
> the function. All my reads are piled up in forward direction only, even
> if they were actually mapped to the backward strand.
>
> Here's a small (nonfunctional since I assume I cannot append data to
> this e-mail) example of how I call the pileup function:
>
> #read in the MAQ alignment file:
> aln_s_7_MCIP4<-readAligned("/home","sequenceALN.txt", type="MAQMapview")
> aln_Pat_subset<-aln_Pat[1:10]
> pos_Patient_subset<-position(aln_Pat_subset)
> str_Patient_subset<-strand(aln_Pat_subset)
> #use only a small subset (all positions are on the same chromosome 1)
>
> subset_pileuptest<-pileup(pos_Patient_subset, 36,
> max(pos_Patient,na.rm=TRUE)+37,dir=str_Patient_subset)
>
> Is there anything wrong with my dir argument? I checked several places
> all over the genome for the complete alignment and my reads all seem to
> be considered on the + strand, even when they are not. Can anybody help?
To get something reproducible, I did
library(ShortRead)
example(readAligned)
aln <- readAligned(ap, "s_2_export.txt", "SolexaExport")
lane <- aln[chromosome(aln) == "chr5.fa"]
mlane <- lane[strand(lane)=="-"]
The first position is
> min(position(mlane))
[1] 6774915
Here's a pileup like the one you provide, with a convoluted way of
finding the first non-zero position in the pileup
> p <- pileup(position(mlane), 50, max(position(mlane)) + 50,
+ strand("-"))
> Rle(p)
'integer' Rle instance of length 140154400 with 25 runs
Lengths: 6774914 50 39858191 50 5640193 50 5762358 50 2355597 50 ...
Values : 0 1 0 1 0 1 0 1 0 1 ...
which is to say that the first non-zero position occurs after 6774914
zeros, or at position 6774915. The read's position is interpreted to
mean 'the position is the leftmost nucleotide of the aligned 50mer'.
Here I add a 'readlength' argument
> p <- pileup(position(mlane), 50, max(position(mlane)) + 50,
+ strand("-"), readlength=35)
> Rle(p)
'integer' Rle instance of length 140154400 with 25 runs
Lengths: 6774899 50 39858191 50 5640193 50 5762358 50 2355597 50 ...
Values : 0 1 0 1 0 1 0 1 0 1 ...
and the 'position' is interpreted in the same way -- the leftmost
position of the aligned _read_. The fragment has been extended so the
read is a total of 50 nt long, and the extension has been in the
appropriate (from 5' to 3' on the minus strand, decrementing the number
of zeros before the first pileup).
I introduced the Rle above in preparation for suggesting use of
'coverage' (see ?"AlignedRead-class") as an alternative to pileup; it's
a much more compact representation and ties in better with some of the
down-stream tools.
> coverage(mlane, start=1L, extend=15L)[[1]]
'integer' Rle instance of length 140154384 with 24 runs
Lengths: 6774899 50 39858191 50 5640193 50 5762358 50 2355597 50 ...
Values : 0 1 0 1 0 1 0 1 0 1 ...
'coverage' can also be tricky to get correct.
Martin
>
> This is my session info:
>
>> sessionInfo()
> R version 2.9.2 (2009-08-24)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=de_DE.UTF-8;LC_NUMERIC=C;LC_TIME=de_DE.UTF-8;LC_COLLATE=de_DE.UTF-8;LC_MONETARY=C;LC_MESSAGES=de_DE.UTF-8;LC_PAPER=de_DE.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_DE.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3 Biostrings_2.12.9
> [5] IRanges_1.2.3
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.4.1 grid_2.9.2 hwriter_1.1
>
> Thanks,
> Nora
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list