[BioC] readGappedAlignmentPairs with multimapping reads
Hervé Pagès
hpages at fhcrc.org
Thu Jun 7 01:33:58 CEST 2012
Hi Vedran,
On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote:
>
> How does the readGappedAlignmentPairs from the GenomicRanges library handle reads that map to several places in the genome?
Good question and I wish there was a simple answer...
readGappedAlignmentPairs() delegates to findMateAlignment() for doing
the pairing. findMateAlignment() does not have a full man page yet
but will have one soon. The man page will explain the algorithm used
for doing the pairing of records loaded from a BAM file.
Here is roughly how it works.
First only records with flag bit 0x1 set to 1, flag bit 0x4 set to 0,
and flag bit 0x8 set to 0 are candidates for pairing (see the SAM
Spec for a description of flag bits and fields). Any other record
is discarded. That is, records that correspond to single end reads,
and records that correspond to paired end reads where one or both
ends are unmapped, are discarded.
Then the algorithm looks at the following fields and flag bits:
(A) QNAME
(B) RNAME, RNEXT
(C) POS, PNEXT
(D) Flag bits Ox10 and 0x20
(E) Flag bits 0x40 and 0x80
2 records rec(i) and rec(j) are considered mates iff all the following
conditions are satisfied:
(A) They have the same QNAME
(B) RNEXT(i) == RNAME(j) and RNEXT(j) == RNAME(i)
(C) PNEXT(i) == POS(j) and PNEXT(j) == POS(i)
(D) Flag bit 0x20 of rec(i) == Flag bit 0x10 of rec(j) and
Flag bit 0x20 of rec(j) == Flag bit 0x10 of rec(i)
(E) rec(i) corresponds to the first segment in the template and
rec(j) corresponds to the last segment in the template
OR
rec(j) corresponds to the first segment in the template and
rec(i) corresponds to the last segment in the template
This algorithm will find almost all pairs unambiguously, even when
the same pair of reads maps to several places in the genome. Note
that when a given pair maps to a single place in the genome, looking
at (A) is enough to pair the 2 corresponding records. The additional
conditions (B), (C), (D) and (E) are only here to help in the situation
where more than 2 records share the same QNAME. And that works most
of the times but there are still situations where this is not enough
to solve the pairing problem unambiguously.
For example, here are 4 records (loaded in a GappedAlignments object)
that cannot be paired with the above algorithm:
** Showing the 4 records as a GappedAlignments object of length 4:
GappedAlignments with 4 alignments and 2 elementMetadata cols:
seqnames strand cigar qwidth start
end
<Rle> <Rle> <character> <integer> <integer>
<integer>
SRR031714.2658602 chr2R + 21M384N16M 37 6983850
6984270
SRR031714.2658602 chr2R + 21M384N16M 37 6983850
6984270
SRR031714.2658602 chr2R - 13M372N24M 37 6983858
6984266
SRR031714.2658602 chr2R - 13M378N24M 37 6983858
6984272
width ngap | mrnm mpos
<integer> <integer> | <factor> <integer>
SRR031714.2658602 421 1 | chr2R 6983858
SRR031714.2658602 421 1 | chr2R 6983858
SRR031714.2658602 409 1 | chr2R 6983850
SRR031714.2658602 415 1 | chr2R 6983850
Note that the BAM fields show up in the following columns:
- QNAME: the names of the GappedAlignments object (unnamed col)
- RNAME: the seqnames col
- POS: the start col
- RNEXT: the mrnm col
- PNEXT: the mpos col
As you can see, the aligner has aligned the same pair to the same
location twice! The only difference between the 2 aligned pairs is in
the cigar i.e. one end of the pair is aligned twice to the same location
with exactly the same cigar while the other end of the pair is aligned
twice to the same location but with slightly different cigars.
** Now showing the corresponding flag bits:
isPaired isProperPair isUnmappedQuery hasUnmappedMate isMinusStrand
[1,] 1 1 0 0 0
[2,] 1 1 0 0 0
[3,] 1 1 0 0 1
[4,] 1 1 0 0 1
isMateMinusStrand isFirstMateRead isSecondMateRead isNotPrimaryRead
[1,] 1 0 1 0
[2,] 1 0 1 0
[3,] 0 1 0 0
[4,] 0 1 0 0
isNotPassingQualityControls isDuplicate
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 0 0
As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4)
are both first mates. But looking at (A), (B), (C), (D) and (E), the
pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could
be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to
disambiguate! Also note that everything is tagged as proper pair (flag
bit 0x2 is set to 1) and primary read (flag bit 0x100 is set to 0), so
using this information would not help disambiguate here.
I'm wondering if there is some other place we should look at in the
BAM file in order to disambiguate (e.g. a tag?), or if those
ambiguous pairings are just part of the life with the SAM Spec.
Not sure whether this is a weakness of the Spec? Or A feature? Any
input on this would be appreciated.
In the meantime, findMateAlignment() is just ignoring those ambiguous
pairings (with a warning) i.e. records that cannot be paired
unambiguously are not paired at all. Concretely that means
that readGappedAlignmentPairs() is guaranteed to return a
GappedAlignmentPairs object where every pair could be formed in an
non-ambiguous way. Note that AFAICS in practice this approach
doesn't seem to leave aside a lot of records because ambiguous
pairing events seem pretty rare.
Cheers,
H.
>
> Sometimes it can happen that one pair of the read is flagged as properly paired even if the other read maps to several locations, how is this handled?
>
> Thank you in advance!
>
> -- output of sessionInfo():
>
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0
> [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4
>
> loaded via a namespace (and not attached):
> [1] stats4_2.15.0 tools_2.15.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list