[BioC] creating GRanges and reading BAM files
Hervé Pagès
hpages at fhcrc.org
Thu Mar 8 21:11:36 CET 2012
Hi Michael, Dave,
On 03/07/2012 09:23 AM, Michael Lawrence wrote:
> Herve, do you think we could support the P in the CIGAR strings? Or does it
> complicate things too much?
According to the SAM spec:
P: padding (silent deletion from padded reference)
Not clear to me what they mean exactly. In particular, what's
the "padded reference"?
So I looked at the example provided in section 1.1 of the spec
(please use a fixed-width font to display this message):
Coor 12345678901234 5678901234567890123456789012345
ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
+r001/1 TTAGATAAAGGATA*CTG
+r002 aaaAGATAA*GGATA
+r003 gcctaAGCTAA
+r004 ATAGCT..............TCAGC
-r003 ttagctTAGGC
-r001/2 CAGCGCCAT
The corresponding SAM format is:
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0
r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *
As you can see, P is used in the cigar of read r002.
So yes the reference sequence is padded with 2 consecutive * but
that's just a displaying trick here that makes it easier to display
the 2-letter insertion in r001/1.
So I'm confused about how relevant the 1P operation in the r002
cigar is. The same cigar but without the 1P operation would be
3S6M1I4M, and it tells me all I need to know about how the read
is aligned to the reference (not to the *padded* reference, why
should I care?).
Am I missing something or are there situations where the P
actually carries more interesting/important meaning? Otherwise,
adapting the cigar utils in GenomicRanges will be easy: they'll
just ignore the P's :-)
But I'd really like to have a closer look at those BAM files
produced by the Roche's software first. Dave do you think you
can put this file (the one with P's), or a subset of it,
somewhere online where I can access it?
Thanks,
H.
>
> On Wed, Mar 7, 2012 at 9:03 AM, David A.<dasolexa at hotmail.com> wrote:
>
>> Thanks a lot Michael.
>>
>> I will modify my pseudo-bed file accordingly.
>>
>>
> If you do actually get your BED file into spec, then you could use
> rtracklayer::import to load it directly as a GRanges. You'll still need to
> set the seqlengths on it.
>
>
>> Regarding padding not being supported, and while waiting to see if it ever
>> gets support so that I can continue with analysis, is there any tool that
>> will automatically unpad the alignment? or do people create their own
>> scripts to remove the P flags? That is all it needs right?
>>
>>
>>
>> Dave
>>
>> ------------------------------
>> Date: Wed, 7 Mar 2012 07:21:31 -0800
>> Subject: Re: [BioC] creating GRanges and reading BAM files
>> From: lawrence.michael at gene.com
>> To: dasolexa at hotmail.com
>> CC: bioconductor at r-project.org
>>
>>
>>
>>
>> On Wed, Mar 7, 2012 at 4:51 AM, David A.<dasolexa at hotmail.com> wrote:
>>
>>
>> Hi list,
>>
>> sorry for a simple question, but I am a newbie a bit lost reading all the
>> information on how to handle NGS data using R tools. I have a set of BAM
>> files from Junior sequencer, one BAM per amplicon per sample (Roche's
>> software does not output one single BAM file per sample including all
>> amplicons). I also have a bed file with the features sequenced. At the
>> moment I am only testing with two amplicons for one sample. My final aim is
>> to get the coverage per amplicon per sample
>>
>> I am using R version 2.14.2 (2012-02-29)
>>
>> I read in the bed file and tried to create a GRanges object:
>>
>>> bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE)
>>> bed.frame
>> chromosome start end
>> 1 APOBex26_M13 1 478
>> 2 APOBex29_M13 1 448
>>
>>
>> This does not appear to be a valid BED file. There should not be any
>> column names, and the intervals should be 0-based. But anyway, let's just
>> say it is BED-like.
>>
>>
>>>
>> gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],end=bed.frame[,3]))
>>> gr
>> GRanges with 2 ranges and 0 elementMetadata values:
>> seqnames ranges strand
>> <Rle> <IRanges> <Rle>
>> [1] APOBex26_M13 [1, 478] *
>> [2] APOBex29_M13 [1, 448] *
>> ---
>> seqlengths:
>> APOBex26_M13 APOBex29_M13
>> NA NA
>>
>>
>> I also read in the list of BAM files vand convert to BamFileList:
>>
>>> fls = list.files(pattern="*bam",full=TRUE)
>>> fls
>> [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam"
>> [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam"
>> bfs<- BamFileList(fls)
>>
>> I then try to summarizeOverlaps but gives me the following error.
>>
>>> olap<-summarizeOverlaps(gr, bfs)
>> Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") :
>> in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported
>> yet, sorry!
>>
>> What is the meaning of this error and how can I overcome it?
>>
>>
>> Well I would just take it at face value: the P cigar operation is not
>> supported yet. Maybe it is time to start supporting it.. :)
>>
>>
>> Is the gr object not created properly? Is Metadata necessary? Why is not
>> seqlengths filled automatically using the IRanges?
>>
>>
>> There is no way to know the seqlengths automatically. You've constructed a
>> GRanges with features that lay on some typically longer sequence, but the
>> sequence length was never specified. You can pass your "end" values as the
>> "seqlengths" to the GRanges constructor.
>>
>> Michael
>>
>>
>> Thanks for your help,
>>
>> Dave
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list