[Bioc-sig-seq] Rsamtools readPileup with Mosaik alignments

Wed Apr 21 17:40:10 CEST 2010

And for the record:

 > sessionInfo()
R version 2.11.0 Under development (unstable) (2010-03-23 r51373)
x86_64-unknown-linux-gnu

locale:
  [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
  [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
  [5] LC_MONETARY=C                  LC_MESSAGES=en_US.iso885915
  [7] LC_PAPER=en_US.iso885915       LC_NAME=C
  [9] LC_ADDRESS=C                   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Rsamtools_0.2.7      Biostrings_2.15.25   GenomicRanges_0.1.16
[4] IRanges_1.5.76

loaded via a namespace (and not attached):
[1] Biobase_2.7.5
 >

James MacDonald wrote:
> Hi,
> 
> I have found what I think is an infelicity with readPileup() when the 
> alignment is done with software other than say BWA or Bowtie.
> 
> I am aligning SOLiD data, so my choices are rather limited. In order to 
> do gapped alignments, I used Mosaik to do the alignment and then 
> converted to bam files and proceeded from there using samtools.
> 
> When I use the readPileup() function, if I choose variant="SNP", it 
> appears I am not reading in all the SNPs, and if I choose 
> variant="indel", it appears I am reading in SNPs as well as indels.
> 
> I think this arises because of an expectation for the pileup format that 
> is outlined on this page:
> 
> http://samtools.sourceforge.net/cns0.shtml 
> 
> where an indel should look like this, in the pileup output:
> 
> seq2  156 A  A  10  0  99  11  .$......+2AG.+2AG.+2AGGG <975;:<<<<<
> seq2  156 *  +AG/+AG  71  252  99  11  +AG  *  3  8  0
> 
> In this case, both lines indicate an indel at the same position, so 
> readPileup() with variant="SNP" will search for lines containing "*" in 
> the third column, and remove a given line and the line preceding it. So 
> in this case, both of these lines would be removed, as expected.
> 
> If variant="indel", then data from both lines are used, which is OK for 
> the data above, but not so much for me.
> 
> This is because the output from pileup that I get when using the Mosaik 
> aligned data doesn't follow that convention. Instead it looks like this:
> 
> chr10 50620982 G A 45 72 31 16 AAAAaaaaAaaaAAac  =<><@=9>@'7:@@:&
> chr10 50623836 G A 60 63 30 19 AA.CAAAAAAAAAaAAAAA <9/::@<@A?:A@>4=:AB
> chr10 50650827 * */-cca 162 162 22 26 * -cca 24 2 0 0 0
> chr10 50650919 * +A/* 182 182 20 27 +A * 5 20 2 0 0
> chr10 50651118 * */-t 72 72 29 51 * -t 48 3 0 0 0
> chr10 50651155 * */-t 145 145 19 21 * -t 13 4 4 0 0
> chr10 50651269 C Y 46 53 42 16 ,tT,,,,TtttTt,,, @A9;<<7<@>8>;0,,
> 
> So when I read these data using readPileup(<thefile>, variant="SNP"), I 
> lose the SNP in the second line. If I read in indels using 
> variant="indel", then the SNP in the second line will be read in.
> 
> I don't know if there are also instances where an indel takes two lines 
> as in the samtools example above, but that would be the minority if so.
> 
> Best,
> 
> Jim
> 
> 
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues