[BioC] creating GRanges and reading BAM files

Thu Mar 8 21:11:36 CET 2012

Hi Michael, Dave,

On 03/07/2012 09:23 AM, Michael Lawrence wrote:
> Herve, do you think we could support the P in the CIGAR strings? Or does it
> complicate things too much?

According to the SAM spec:

   P: padding (silent deletion from padded reference)

Not clear to me what they mean exactly. In particular, what's
the "padded reference"?

So I looked at the example provided in section 1.1 of the spec
(please use a fixed-width font to display this message):

     Coor    12345678901234  5678901234567890123456789012345
     ref     AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
     +r001/1       TTAGATAAAGGATA*CTG
     +r002        aaaAGATAA*GGATA
     +r003      gcctaAGCTAA
     +r004                    ATAGCT..............TCAGC
     -r003                           ttagctTAGGC
     -r001/2                                       CAGCGCCAT

   The corresponding SAM format is:

   @HD VN:1.3 SO:coordinate
   @SQ SN:ref LN:45
   r001 163 ref  7 30 8M2I4M1D3M = 37  39 TTAGATAAAGGATACTG *
   r002   0 ref  9 30 3S6M1P1I4M *  0   0 AAAAGATAAGGATA    *
   r003   0 ref  9 30 5H6M       *  0   0 AGCTAA   *   NM:i:1
   r004   0 ref 16 30 6M14N5M    *  0   0 ATAGCTTCAGC       *
   r003  16 ref 29 30 6H5M       *  0   0 TAGGC    *   NM:i:0
   r001  83 ref 37 30 9M         =  7 -39 CAGCGCCAT         *

As you can see, P is used in the cigar of read r002.

So yes the reference sequence is padded with 2 consecutive * but
that's just a displaying trick here that makes it easier to display
the 2-letter insertion in r001/1.

So I'm confused about how relevant the 1P operation in the r002
cigar is. The same cigar but without the 1P operation would be
3S6M1I4M, and it tells me all I need to know about how the read
is aligned to the reference (not to the *padded* reference, why
should I care?).

Am I missing something or are there situations where the P
actually carries more interesting/important meaning? Otherwise,
adapting the cigar utils in GenomicRanges will be easy: they'll
just ignore the P's :-)

But I'd really like to have a closer look at those BAM files
produced by the Roche's software first. Dave do you think you
can put this file (the one with P's), or a subset of it,
somewhere online where I can access it?

Thanks,
H.

>
> On Wed, Mar 7, 2012 at 9:03 AM, David A.<dasolexa at hotmail.com>  wrote:
>
>>   Thanks a lot Michael.
>>
>> I will modify my pseudo-bed file accordingly.
>>
>>
> If you do actually get your BED file into spec, then you could use
> rtracklayer::import to load it directly as a GRanges. You'll still need to
> set the seqlengths on it.
>
>
>> Regarding padding not being supported, and while waiting to see if it ever
>> gets support so that I can continue with analysis, is there any tool that
>> will automatically unpad the alignment? or do people create their own
>> scripts to remove the P flags? That is all it needs right?
>>
>>
>>
>> Dave
>>
>> ------------------------------
>> Date: Wed, 7 Mar 2012 07:21:31 -0800
>> Subject: Re: [BioC] creating GRanges and reading BAM files
>> From: lawrence.michael at gene.com
>> To: dasolexa at hotmail.com
>> CC: bioconductor at r-project.org
>>
>>
>>
>>
>> On Wed, Mar 7, 2012 at 4:51 AM, David A.<dasolexa at hotmail.com>  wrote:
>>
>>
>> Hi list,
>>
>> sorry for a simple question, but I am a newbie a bit lost reading all the
>> information on how to handle NGS data using R tools. I have a set of BAM
>> files from Junior sequencer, one BAM per amplicon per sample (Roche's
>> software does not output one single BAM file per sample including all
>> amplicons). I also have a bed file with the features sequenced. At the
>> moment I am only testing with two amplicons for one sample. My final aim is
>> to get the coverage per amplicon per sample
>>
>> I am using R version 2.14.2 (2012-02-29)
>>
>> I read in the bed file and tried to create a GRanges object:
>>
>>> bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE)
>>> bed.frame
>>      chromosome start end
>> 1 APOBex26_M13      1 478
>> 2 APOBex29_M13      1 448
>>
>>
>> This does not appear to be a valid BED file. There should not be any
>> column names, and the intervals should be 0-based. But anyway, let's just
>> say it is BED-like.
>>
>>
>>>
>> gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],end=bed.frame[,3]))
>>> gr
>> GRanges with 2 ranges and 0 elementMetadata values:
>>            seqnames    ranges strand
>>               <Rle>  <IRanges>   <Rle>
>>   [1] APOBex26_M13   [1, 478]      *
>>   [2] APOBex29_M13   [1, 448]      *
>>   ---
>>   seqlengths:
>>    APOBex26_M13 APOBex29_M13
>>               NA            NA
>>
>>
>> I also read in the list of BAM files vand convert to BamFileList:
>>
>>> fls = list.files(pattern="*bam",full=TRUE)
>>> fls
>>   [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam"
>>   [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam"
>> bfs<- BamFileList(fls)
>>
>> I then try to summarizeOverlaps but gives me the following error.
>>
>>> olap<-summarizeOverlaps(gr, bfs)
>> Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") :
>>   in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported
>> yet, sorry!
>>
>> What is the meaning of this error and how can I overcome it?
>>
>>
>> Well I would just take it at face value: the P cigar operation is not
>> supported yet. Maybe it is time to start supporting it.. :)
>>
>>
>> Is the gr object not created properly? Is Metadata necessary? Why is not
>> seqlengths filled automatically using the IRanges?
>>
>>
>> There is no way to know the seqlengths automatically. You've constructed a
>> GRanges with features that lay on some typically longer sequence, but the
>> sequence length was never specified. You can pass your "end" values as the
>> "seqlengths" to the GRanges constructor.
>>
>> Michael
>>
>>
>> Thanks for your help,
>>
>> Dave
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319